# Week 9 NLP - Final Project Part 2
Lauren Madar
October 25 2022

# 1. A Web Application for Drug Adverse Event Recognition in Medical Text using NLP (Lauren Madar)

In [93]:
import pandas as pd
import nltk
import re
import numpy as np
from bs4 import BeautifulSoup as bs
from collections import Counter
import text_normalizer as tn 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")



# 2. Preprocessing steps

## Medications / Drugs

Build a dictionary of medications using the OpenFDA database (https://open.fda.gov/apis/drug/ndc/download/) which are the files in the <strong>./data/</strong> directory named <strong>package.txt</strong> and <strong>product.txt</strong>.

In [26]:
drug_df = pd.read_csv('./data/product.txt', sep='\t', encoding='latin1')
drug_df.info()

package_df =  pd.read_csv('./data/package.txt', sep='\t', encoding='latin1')
package_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115633 entries, 0 to 115632
Data columns (total 20 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   PRODUCTID                         115633 non-null  object 
 1   PRODUCTNDC                        115633 non-null  object 
 2   PRODUCTTYPENAME                   115633 non-null  object 
 3   PROPRIETARYNAME                   115618 non-null  object 
 4   PROPRIETARYNAMESUFFIX             11136 non-null   object 
 5   NONPROPRIETARYNAME                115629 non-null  object 
 6   DOSAGEFORMNAME                    115633 non-null  object 
 7   ROUTENAME                         113551 non-null  object 
 8   STARTMARKETINGDATE                115633 non-null  int64  
 9   ENDMARKETINGDATE                  3998 non-null    float64
 10  MARKETINGCATEGORYNAME             115633 non-null  object 
 11  APPLICATIONNUMBER                 99860 non-null   o

Let's ignore the package data for now, which seems to have all of the different types of packs that the drug products are sold in. That's not going to be important to what we're looking for because we just want to match a drug brand name or generic name.

In [27]:
drug_df.head()

Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
0,0002-0800_b02ed630-6947-431a-a8c8-227571403941,0002-0800,HUMAN OTC DRUG,Sterile Diluent,,diluent,"INJECTION, SOLUTION",SUBCUTANEOUS,19870710,,BLA,BLA018781,Eli Lilly and Company,WATER,1.0,mL/mL,,,N,20231231.0
1,0002-1200_480fceef-6596-4478-97de-677c155506b3,0002-1200,HUMAN PRESCRIPTION DRUG,Amyvid,,Florbetapir F 18,"INJECTION, SOLUTION",INTRAVENOUS,20120601,,NDA,NDA202008,Eli Lilly and Company,FLORBETAPIR F-18,51.0,mCi/mL,"Positron Emitting Activity [MoA], Radioactive ...",,N,20221231.0
2,0002-1210_d03b2693-0231-4df4-a037-63017a42e85a,0002-1210,HUMAN PRESCRIPTION DRUG,TAUVID,,Flortaucipir F-18,"INJECTION, SOLUTION",INTRAVENOUS,20200528,,NDA,NDA212123,Eli Lilly and Company,FLORTAUCIPIR F-18,51.0,mCi/mL,,,N,20231231.0
3,0002-1220_d03b2693-0231-4df4-a037-63017a42e85a,0002-1220,HUMAN PRESCRIPTION DRUG,TAUVID,,Flortaucipir F-18,"INJECTION, SOLUTION",INTRAVENOUS,20220701,,NDA,NDA212123,Eli Lilly and Company,FLORTAUCIPIR F-18,100.0,mCi/mL,,,N,20231231.0
4,0002-1433_02015dc8-99de-47b3-b543-5b2a6db567b7,0002-1433,HUMAN PRESCRIPTION DRUG,Trulicity,,Dulaglutide,"INJECTION, SOLUTION",SUBCUTANEOUS,20140918,,BLA,BLA125469,Eli Lilly and Company,DULAGLUTIDE,0.75,mg/.5mL,"GLP-1 Receptor Agonist [EPC], Glucagon-Like Pe...",,N,20231231.0


PRODUCTNDC is an identifier used to mark each drug at a high level. That will be useful. So will the Product Type Name, which can tell us if it's an over-the-counter or prescription-only drug.

Proprietary Name Suffix, start and end marketing dates, DEA Schedule, and Listing Record Certified Through are not relevant.  

Now, let's do some cleanup - discard/drop some columns that we're going to ignore, and lowercase EVERYTHING.

In [28]:
drug_df.drop(['PROPRIETARYNAMESUFFIX', 'STARTMARKETINGDATE', 'ENDMARKETINGDATE', 'DEASCHEDULE', 'NDC_EXCLUDE_FLAG', 'LISTING_RECORD_CERTIFIED_THROUGH'], axis=1, inplace=True)

In [29]:
drug_df = drug_df.applymap(lambda s: s.lower() if type(s) == str else s)
drug_df.columns = map(str.lower, drug_df.columns)

In [30]:
drug_df.head()

Unnamed: 0,productid,productndc,producttypename,proprietaryname,nonproprietaryname,dosageformname,routename,marketingcategoryname,applicationnumber,labelername,substancename,active_numerator_strength,active_ingred_unit,pharm_classes
0,0002-0800_b02ed630-6947-431a-a8c8-227571403941,0002-0800,human otc drug,sterile diluent,diluent,"injection, solution",subcutaneous,bla,bla018781,eli lilly and company,water,1.0,ml/ml,
1,0002-1200_480fceef-6596-4478-97de-677c155506b3,0002-1200,human prescription drug,amyvid,florbetapir f 18,"injection, solution",intravenous,nda,nda202008,eli lilly and company,florbetapir f-18,51.0,mci/ml,"positron emitting activity [moa], radioactive ..."
2,0002-1210_d03b2693-0231-4df4-a037-63017a42e85a,0002-1210,human prescription drug,tauvid,flortaucipir f-18,"injection, solution",intravenous,nda,nda212123,eli lilly and company,flortaucipir f-18,51.0,mci/ml,
3,0002-1220_d03b2693-0231-4df4-a037-63017a42e85a,0002-1220,human prescription drug,tauvid,flortaucipir f-18,"injection, solution",intravenous,nda,nda212123,eli lilly and company,flortaucipir f-18,100.0,mci/ml,
4,0002-1433_02015dc8-99de-47b3-b543-5b2a6db567b7,0002-1433,human prescription drug,trulicity,dulaglutide,"injection, solution",subcutaneous,bla,bla125469,eli lilly and company,dulaglutide,0.75,mg/.5ml,"glp-1 receptor agonist [epc], glucagon-like pe..."


## Symptoms

Build a dictionary of symptoms using the symptoms category of the Diseases Database http://www.diseasesdatabase.com/item_relationships.asp?glngUserChoice=28856&bytRel=0&strBB=RL&Key={37CC62D8-7997-4047-881A-265927C6A827} which was saved as SymptomsIndex_DiseasesDB.html on October 24, 2022.

Example snippet from HTML file:
```html
<li><strong><a href="http://www.diseasesdatabase.com/ddb30819.htm" rel="nofollow">Abdominal distension</a></strong></li>
<li><strong><a href="http://www.diseasesdatabase.com/ddb14326.htm" rel="nofollow">Abdominal mass</a></strong></li>
<li><strong><a href="http://www.diseasesdatabase.com/ddb14367.htm" rel="nofollow">Abdominal pain</a></strong></li>
```
Look for list items (<strong>li</strong>) that have links (<strong>a</strong>) wrapped in strong tags, and strip out the info link (<strong>href</strong> attribute) and symptom text.

In [31]:

HTMLFile = open("./data/SymptomsIndex_DiseasesDB.html", "r")
symptomcode = HTMLFile.read()
soup = bs(symptomcode, 'lxml')
  

symptomlist_html = soup.select("li strong a")
symptomlist_parsed = []
symptomlist_urls = []
symptomlist_split = []
for li in symptomlist_html:
    symptomlist_parsed.append(li.get_text().lower())
    symptomlist_urls.append(li['href'])
    symptomlist_split.append(li.get_text().lower().split(' '))
      
# print(symptomlist_parsed)
# print(symptomlist_urls)

symptom_df = pd.DataFrame()
symptom_df['symptom'] = symptomlist_parsed
symptom_df['desc_url'] = symptomlist_urls
symptom_df['wordlist'] = symptomlist_split

symptom_df.head()

Unnamed: 0,symptom,desc_url,wordlist
0,10th cranial nerve disorder,http://www.diseasesdatabase.com/ddb2858.htm,"[10th, cranial, nerve, disorder]"
1,11th cranial nerve disorder,http://www.diseasesdatabase.com/ddb2859.htm,"[11th, cranial, nerve, disorder]"
2,12th cranial nerve disorder,http://www.diseasesdatabase.com/ddb29576.htm,"[12th, cranial, nerve, disorder]"
3,1st cranial nerve disorder,http://www.diseasesdatabase.com/ddb28966.htm,"[1st, cranial, nerve, disorder]"
4,3rd cranial nerve disorder,http://www.diseasesdatabase.com/ddb2861.htm,"[3rd, cranial, nerve, disorder]"


Note that some symptoms have <strong>one or more words</strong>. It may be the case that physician notes may not include all the words for a given symptom, so we should plan on splitting out these words.  There are also some symptoms that are similar (like the cranial nerve disorder items).  This is why we split the symptom string into a <strong>wordlist</strong> column.

We can also use lemmatization and stemming to normalize the words in each symptom, which will help us match normalized text in the patient notes.

In [32]:
norm_symptoms = tn.normalize_corpus(corpus=symptom_df['symptom'], html_stripping=True, contraction_expansion=True, 
                                  accented_char_removal=True, text_lower_case=True, text_lemmatization=False, 
                                  text_stemming=True, special_char_removal=True, remove_digits=True, 
                                  stopword_removal=True)

symptom_df['norm_symptom'] = norm_symptoms
symptom_df.head()

Unnamed: 0,symptom,desc_url,wordlist,norm_symptom
0,10th cranial nerve disorder,http://www.diseasesdatabase.com/ddb2858.htm,"[10th, cranial, nerve, disorder]",th cranial nerv disord
1,11th cranial nerve disorder,http://www.diseasesdatabase.com/ddb2859.htm,"[11th, cranial, nerve, disorder]",th cranial nerv disord
2,12th cranial nerve disorder,http://www.diseasesdatabase.com/ddb29576.htm,"[12th, cranial, nerve, disorder]",th cranial nerv disord
3,1st cranial nerve disorder,http://www.diseasesdatabase.com/ddb28966.htm,"[1st, cranial, nerve, disorder]",st cranial nerv disord
4,3rd cranial nerve disorder,http://www.diseasesdatabase.com/ddb2861.htm,"[3rd, cranial, nerve, disorder]",rd cranial nerv disord


Let's remove some artifacts left from digit removal - like "st", "nd", "rd", "th" which occur on their own.

In [33]:
ordinal_endings = ["st", "nd", "rd", "th"]
norm_wordlist = []
for symp in symptom_df['norm_symptom']:
    wordlist = symp.split(" ")
    norm_wordlist_item = [w for w in wordlist if not w in ordinal_endings]
    norm_wordlist.append(norm_wordlist_item)

symptom_df["norm_wordlist"] = norm_wordlist
symptom_df.head()

Unnamed: 0,symptom,desc_url,wordlist,norm_symptom,norm_wordlist
0,10th cranial nerve disorder,http://www.diseasesdatabase.com/ddb2858.htm,"[10th, cranial, nerve, disorder]",th cranial nerv disord,"[cranial, nerv, disord]"
1,11th cranial nerve disorder,http://www.diseasesdatabase.com/ddb2859.htm,"[11th, cranial, nerve, disorder]",th cranial nerv disord,"[cranial, nerv, disord]"
2,12th cranial nerve disorder,http://www.diseasesdatabase.com/ddb29576.htm,"[12th, cranial, nerve, disorder]",th cranial nerv disord,"[cranial, nerv, disord]"
3,1st cranial nerve disorder,http://www.diseasesdatabase.com/ddb28966.htm,"[1st, cranial, nerve, disorder]",st cranial nerv disord,"[cranial, nerv, disord]"
4,3rd cranial nerve disorder,http://www.diseasesdatabase.com/ddb2861.htm,"[3rd, cranial, nerve, disorder]",rd cranial nerv disord,"[cranial, nerv, disord]"


There will be at least a few duplicates from the cranial nerve disorder items which we should collapse down, but let's ignore that for now.

## Import note documents
These will be the text corpuses we are going to parse and search for potential adverse events.

In [34]:
patnotes_df = pd.read_csv('./data/patient_notes.csv')
patnotes_df.info()
patnotes_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42146 entries, 0 to 42145
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   pn_num      42146 non-null  int64 
 1   case_num    42146 non-null  int64 
 2   pn_history  42146 non-null  object
dtypes: int64(2), object(1)
memory usage: 987.9+ KB


Unnamed: 0,pn_num,case_num,pn_history
0,0,0,"17-year-old male, has come to the student heal..."
1,1,0,17 yo male with recurrent palpitations for the...
2,2,0,Dillon Cleveland is a 17 y.o. male patient wit...
3,3,0,a 17 yo m c/o palpitation started 3 mos ago; \...
4,4,0,17yo male with no pmh here for evaluation of p...


In [35]:
mtsamples_df = pd.read_csv('./data/mtsamples.csv')
mtsamples_df.info()
mtsamples_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         4999 non-null   int64 
 1   description        4999 non-null   object
 2   medical_specialty  4999 non-null   object
 3   sample_name        4999 non-null   object
 4   transcription      4966 non-null   object
 5   keywords           3931 non-null   object
dtypes: int64(1), object(5)
memory usage: 234.5+ KB


Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


For these two datasets, it looks like the patient_notes items only have one interesting column, pn_history (which is patient history). There are over forty thousand records.

The second dataset is much smaller (almost five thousand records) but has more relevant columns to look at, like description, specialty, transcription, and keywords.

There are a few options here - we can merge all of the records into one dataframe and use fillna to create empty entries for null values for the first dataset.

Or, we can choose the first or second dataset and ignore the other for now.

Since time is currently limited for this assignment step, let's focus on the simpler-but-larger first dataset (patient_notes.csv) and ignore mtsamples.csv for now.

In [36]:
norm_patnotes = tn.normalize_corpus(corpus=patnotes_df['pn_history'], html_stripping=True, contraction_expansion=True, 
                                  accented_char_removal=True, text_lower_case=True, text_lemmatization=False, 
                                  text_stemming=True, special_char_removal=True, remove_digits=True, 
                                  stopword_removal=True)



In [37]:
patnotes_df['norm_corpus'] = norm_patnotes

In [38]:
patnotes_df.head()

Unnamed: 0,pn_num,case_num,pn_history,norm_corpus
0,0,0,"17-year-old male, has come to the student heal...",yearold male ha come student health clinic com...
1,1,0,17 yo male with recurrent palpitations for the...,yo male recurr palpit past mo last min happen ...
2,2,0,Dillon Cleveland is a 17 y.o. male patient wit...,dillon cleveland male patient signific pmh pre...
3,3,0,a 17 yo m c/o palpitation started 3 mos ago; \...,yo co palpit start mo ago noth improv exacerb ...
4,4,0,17yo male with no pmh here for evaluation of p...,yo male pmh evalu palpitations state last mo h...


We should remove names from the normalized corpus if present - since this is private information.  However, no time to do that right now.  In future iterations, we can use examples like those found here: zhttps://stackoverflow.com/questions/20290870/improving-the-extraction-of-human-names-with-nltk or we can use spaCy.

# 3. Feature extraction
Create a vocabulary of medication names and symptoms. Identify if any of the patient notes text corpuses has medications or symptoms.

In [71]:
med_vocab = [] 
for wordl in drug_df["nonproprietaryname"]:
    if isinstance(wordl, str):
        w = wordl.split(" ")
        med_vocab.extend(w)
    
for wordl in drug_df["proprietaryname"]:
     if isinstance(wordl, str):
        #w = wordl.split(" ") # Keep proprietary names as a whole term
        med_vocab.append(wordl)
    
med_vocab = [word for word in med_vocab if len(word) > 3]

med_vocab = list(dict.fromkeys(med_vocab)) #get rid of duplicates

med_vocab = tn.normalize_corpus(corpus=med_vocab, html_stripping=True, contraction_expansion=True, 
                                  accented_char_removal=True, text_lower_case=True, text_lemmatization=False, 
                                  text_stemming=False, special_char_removal=True, remove_digits=True, 
                                  stopword_removal=True)


med_vocab = list(filter(None, med_vocab)) # get rid of empty strings

In [73]:
symptom_vocab = [] 
for wordl in symptom_df["norm_wordlist"]:
    symptom_vocab.extend(wordl)
    
symptom_vocab = list(dict.fromkeys(symptom_vocab)) #get rid of duplicates

# 4. Main functionality
Find phrases that contain a possible adverse event (medication plus a symptom).  Classify each document as "may contain an adverse event" or not, capture medication names and symptoms for each document.

Future improvements: predict likelihood and/or severity of the adverse event using sentiment analysis and similarity.

### Create spaCy Matcher

In [131]:
# Initialize matcher patterns, including the items in the med vocabulary and the symptom vocabulary.
med_sym_matcher = Matcher(nlp.vocab)

for med in med_vocab:
    pattern = [[{"TEXT": med}]]
    med_sym_matcher.add("MED_" + med, pattern)

for sym in symptom_vocab:
    pattern = [[{"TEXT": sym}]]
    med_sym_matcher.add("SYM_" + sym, pattern)

# try out the matchers...
doc = nlp("I am not sick but I may have taken valium once or twice when I had a headache and insomnia")

matches = med_sym_matcher(doc)
print(matches)

[(8897826599849812487, 9, 10), (7282835221074212695, 17, 18), (15459807811483235343, 19, 20)]


In [134]:
def get_matched_terms(doc, matches):
    matched_terms = []
    for ma in matches:
        match_id, start, end = ma
        # Get the matched span
        matched_span = doc[start:end]
        matched_terms.append(matched_span.text)
    return matched_terms

In [135]:
get_matched_terms(doc, matches)

['valium', 'headache', 'insomnia']

### Run the Matcher on the patient notes!

In [None]:
results = []
result_terms = []
entries = patnotes_df['norm_corpus']

for patient_note in entries:
    pn_doc = nlp(patient_note)
    pn_matches = med_sym_matcher(pn_doc)
    pn_terms = get_matched_terms(pn_doc, pn_matches)
    results.append(len(pn_matches)>0)
    result_terms.append(pn_terms)


In [154]:
patnotes_df['matched'] = results
patnotes_df['matched_terms']= result_terms

patnotes_df.describe()


[True, True, True, True, True] [['male', 'male', 'health', 'heart', 'heart', 'treatment', 'heart', 'heart', 'chest', 'chest', 'chest', 'chest', 'thyroid'], ['male', 'male', 'palpit', 'day', 'light', 'pressur', 'chest', 'chest', 'breath', 'breath', 'diarrhea', 'heat', 'weight', 'weight', 'loss', 'concentrate'], ['male', 'male', 'heart', 'heart', 'ani', 'rest', 'chest', 'chest', 'pressur', 'pain', 'pain', 'chest', 'chest', 'chest', 'chest', 'pressure', 'short', 'short', 'breath', 'breath', 'heart', 'heart', 'short', 'short', 'breath', 'breath', 'chest', 'chest', 'pain', 'pain', 'anxiety', 'medication', 'thyroid', 'heart', 'heart', 'age', 'live', 'smoking', 'drug'], ['co', 'palpit', 'symptom', 'symptom', 'ani', 'day', 'pressur', 'fall', 'nausea', 'nausea', 'headache', 'abdomin', 'pain', 'pain', 'chang', 'urin', 'bowel', 'tremor', 'skin', 'skin', 'hair', 'hair', 'chang', 'none', 'none', 'none', 'thyroid', 'x', 'smoking', 'cage'], ['male', 'male', 'state', 'heart', 'heart', 'beat', 'chest',

Obviously there are a lot of refinements to be done.  Splitting the sypmtoms into single words allows unintended matches like 'male'.  There are also duplicate matches, which we could better represent as counts.  

We also need to refine the match rules so that we can determine if at least one medication and at least one symptom have been matched, possibly with separate matchers or by looking at the match results more closely to see which pattern had matched.

# 5. Personal Contribution Statement - Individual Project
   
Certain aspects of this project took longer than I thought.  In looking at different tools, I started trying to do train/test splits, but since this is a classification problem WITHOUT existing labels, what I started to do with TFIDF was not helpful.

Once I started using spaCy instead, things went much more quickly, but I hard run out of time to refine the pattern matching I was using, and fine-tuning the symptom and drug name preprocessing.
    
Future improvements could include:
* Using an API to build the symptom and medication name dictionaries.  
* Use medical codes to determine disease state from the documents to "hone in" on specific symptoms or medications.  
* Use classification of symptoms and medications to allow filtering or focus/efficiency.  Extend this to the results to rate or predict the likelihood of a serious (highly negative sentiment) adverse event in a document. 
* Determine a positive or negative sentiment on a document, and only concentrate on identifying adverse events in negative sentiment documents, but this logic might not be quite right, as a generally negative visit might not indicate an adverse reaction if a drug/medication was having the desired effect.

 