In [1]:
import pandas as pd
import numpy as np
import nltk
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
import string

from gensim import corpora, models, similarities

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sarap\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Finding the best matching research article entries for every patient profile

On the first glance, the problem seems to be the simple case of similarity calculation between patient profiles and relevant research articles. However, some features like, 

+ prescribed drug
+ genetic variants (rsIDs)

have more contribution towards final decision than features like,

+ diagnosis
+ anscestry
+ age
    
In case of supervised learning, the feature relevance (impact) calculation becomes simpler due to available techniques. This is more of a clustering problem, so normal feature relevance calculation is a moot point. Hence, the decision is based on the logic.

```
The research article entries could have hundreds of senarios with caucasian population suffering from type 2 diabetes. However, relevance of these entries will boost significantly if they are discussing the specific drug which is prescribed to the patient and also one of the genetic variants of the patient.
```

Hence, final decision was to build ensemble technique which primarily considers three features :
1. prescribed drug
2. genetic variants (rsIDs)
3. similarity measure based on diagnostic_information and generic_information (age, sex, ancestry)

In [2]:
patient_profile = 'D:/MedicalNLP/job-post-ai-patient-profiles.csv'
pat_prof_df = pd.read_csv(patient_profile, sep = '\t')
pat_prof_df.sample(3)

Unnamed: 0,Patient,Age,Sex,Ancestry,Diagnosis/Conditions,Drugs taken,rsIDs
2,C,37,male,caucasian,"HIV infection, type 2 diabetes",metformin,"rs10306114,rs1042522,rs10455872,rs1045642,rs10..."
1,B,43,male,black/african american,"epilepsy, seizures, stroke, nerve pain, asthma",metaproterenol,"rs10306114,rs1042522,rs10455872,rs1045642,rs10..."
0,A,54,female,asian,"ulcerative colitis,inflammatory bowel disease,...",paroxetine,"rs10306114,rs1042522,rs10455872,rs1045642,rs10..."


In [3]:
research_info = 'D:/MedicalNLP/job-post-ai-database-results.csv'
research_df = pd.read_csv(research_info, sep = '\t')
research_df.sample(5)

Unnamed: 0,Variant,Gene,Chemical,Phenotype Category,Significance,Notes,Sentence,Chromosome,Article Title,Article Abstract
64,rs2032582,ABCB1 (PA267),tacrolimus (PA451578),metabolism/PK,yes,3 - 12 months post-transplant. In multivariate...,Genotype AA is associated with decreased clear...,chr7,Relationship of CYP3A5 genotype and ABCB1 dipl...,Tacrolimus (TAC) is one of the most successful...
27,rs1045642,ABCB1 (PA267),midazolam (PA450496),"other,""metabolism/PK""",no,It's not clear exactly what genotype compariso...,Allele A is not associated with increased clea...,chr7,Induction of CYP3A4 by vinblastine: Role of th...,Several microtubule targeting agents are capab...
68,rs1045642,ABCB1 (PA267),cyclosporine (PA449167),efficacy,yes,The A allele was found to have a higher freque...,Allele A is associated with decreased response...,chr7,A pharmacogenetic study of ABCB1 polymorphisms...,"Psoriasis affects 2-3% of the population, caus..."
752,rs1800629,TNF (PA435),infliximab (PA452639),efficacy,yes,Good responders defined as patients whose dise...,Genotype GG is associated with increased respo...,chr6,Polymorphism at position -308 of the tumor nec...,To test whether the G-to-A polymorphism at pos...
1578,rs4149056,SLCO1B1 (PA134865839),"atorvastatin (PA448500),""simvastatin (PA451363)""",efficacy,no,"While, on average, total cholesterol, triglyce...",Genotype CC is not associated with increased r...,chr12,Lack of association between SLCO1B1 polymorphi...,There is significant inter-individual variabil...


# Readying the data


### Feature combination and elimination
#### Patient profiles
All the information given here is precise and significant. Hence, no elimination was performed. Although, feature combination was done combining features depicting diagnostic_information and generic_information (age, sex, ancestry).
#### Research articles
There is overwhelming amount of irrelevant information in this dataset. The patient profile does not discuss features like,
+ gene
+ chromosome

Also following features can be considered later while studying the relevant articles, as our similarity calculation is not semantic but BOW based
+ phenotype_catagory
+ significance 

Additional the feature, 
+ article abstract

consists of too much information containing lengthy genetic sequences, numeric values, units of measures (mg/DL, % etc.). In absence of sophisticated information extraction, this field worsens the final hypothesis. Because it adds a lot of irrelevant vocabulary in our BOW. 
Hence all these features are eliminated.

### Data clean-up

The data preparation / data clean-up is the crucial step when building ML solution. Especially, for NLP it becomes a huge responsibility as there is no fixed set of dos and don'ts. If one is using dictionary-based methods, it is natural to prune the data removing any irrelevant numbers, sequences, stop words, punctuations. But employing these techniques will be foolish when one used word embeddings or encoder (Neural Networks) based techniques, as everything hinges upon relative positions in the sentence and around other words. 

For this case, I decided to use traditional ML (BOW + tfidf) and gensim for relevance calculation. Hence I am using data clean-up techniques like punctuation + stop-word removal, lemmatization for easier correlation etc.


In [4]:
# using traditional NLP techniques like case altering, punctuation{ stopword} removal and lemmatization
# effective for BOW/dictionary based techniques but removes positional or relative information
# not recommended for deep learning based techniques like word vectors or encoders
stopset = stopwords.words('english')
translator = str.maketrans(string.punctuation, ' '*len(string.punctuation)) 
wordnet_lemmatizer = WordNetLemmatizer()
def clean_text(text_ex):

    #print(text_ex)
    #remove numbers but keep alphanumeric
    #text_ex = re.sub(r'\W+', ' ', text_ex)
    
    # removing digits
    #text_ex = re.sub(r"\d", "", text_ex)
    
    #removing single alphabets
    text_ex = re.sub(r"\s+[a-zA-Z]\s+", " ", text_ex)
    #text_ex = re.sub(r"\s+[a-zA-Z]+[a-zA-Z]\s+", " ", text_ex)
    
    text_ex = text_ex.lower()
    
    #map punctuation to space
    text_ex = text_ex.translate(translator)
    
    tokens = nltk.word_tokenize(text_ex)
    #stopword removal
    tokens = [w for w in tokens if w not in stopset]
    tokens = [wordnet_lemmatizer.lemmatize(w) for w in tokens]
    
    return " ".join(tokens)

def remove_regex(text_ex):

    #removing regex
    return re.sub(r"\spa\d+", " ", text_ex)
    

In [5]:
# data readying for patient profiles

patient_profile = 'D:/MedicalNLP/job-post-ai-patient-profiles.csv'
pat_prof_df = pd.read_csv(patient_profile, sep = '\t')

# finding specific age value in the data is challenging. However, found some articles discussing drug effects on adults and children
# Hence converting numerical to catagorical variable
pat_prof_df['Age'] = pat_prof_df['Age'].apply(lambda x: 'adult' if x>20 else 'child')

#combining interesting columns
pat_prof_df['profile'] = pat_prof_df.Age +" " + pat_prof_df.Sex + " " + pat_prof_df.Ancestry +" "+pat_prof_df['Diagnosis/Conditions'] + " " + pat_prof_df['Drugs taken']
pat_prof_df['drug'] = pat_prof_df['Drugs taken'] 
pat_prof_df['gen_variant'] = pat_prof_df['rsIDs']

# dropping insignificant data
pat_prof_df = pat_prof_df.drop(['Age', 'Sex', 'Ancestry', 'Diagnosis/Conditions', 'rsIDs', 'Drugs taken'], axis=1)

#removing punctuations, stopwords and lemmatizing the text
pat_prof_df['profile'] = pat_prof_df['profile'].apply(clean_text)
pat_prof_df['drug'] = pat_prof_df['drug'].apply(clean_text)
pat_prof_df['gen_variant'] = pat_prof_df['gen_variant'].apply(clean_text)


pat_prof_df.sample(3)

Unnamed: 0,Patient,profile,drug,gen_variant
1,B,adult male black african american epilepsy sei...,metaproterenol,rs10306114 rs1042522 rs10455872 rs1045642 rs10...
2,C,adult male caucasian hiv infection type 2 diab...,metformin,rs10306114 rs1042522 rs10455872 rs1045642 rs10...
0,A,adult female asian ulcerative colitis inflamma...,paroxetine,rs10306114 rs1042522 rs10455872 rs1045642 rs10...


In [6]:
# data readying for research articles

research_info = 'D:/MedicalNLP/job-post-ai-database-results.csv'
research_df = pd.read_csv(research_info, sep = '\t')

# taking care of missing values by adding stopwords in-place which will be removed later
research_df['Notes'] = research_df['Notes'].fillna('is of')
research_df['Article Abstract'] = research_df['Article Abstract'].fillna('is of')

#combining interesting columns
research_df['drug'] = research_df['Chemical']
research_df['gen_variant'] = research_df['Variant']
research_df['summary'] = research_df['Sentence']+" " +  research_df['Article Title']+ " "+ research_df['Notes']


# dropping insignificant data
research_df = research_df.drop(['Gene', 'Variant','Chemical', 'Phenotype Category', 'Significance', 'Notes', 'Sentence', 'Chromosome', 'Article Title', 'Article Abstract'], axis=1)


#removing punctuations, stopwords and lemmatizing the text
research_df['drug'] = research_df['drug'].apply(clean_text)
research_df['drug'] = research_df['drug'].apply(remove_regex)
research_df['summary'] = research_df['summary'].apply(clean_text)
research_df['gen_variant'] = research_df['gen_variant'].apply(clean_text)

research_df.sample(10)


Unnamed: 0,drug,gen_variant,summary
1935,warfarin,rs9923231,allele associated clearance warfarin compared ...
167,apixaban,rs1045642,genotype aa ag associated clearance apixaban p...
34,carbamazepine,rs1045642,allele associated increased dose carbamazepine...
790,folic acid hydroxychloroquine methotrexate ...,rs1801133,allele associated response folic acid hydroxyc...
857,tipifarnib,rs2032582,allele associated increased metabolism tipifar...
1880,warfarin,rs9923231,allele associated decreased dose warfarin chil...
651,fentanyl,rs1799971,allele associated dose fentanyl people pain po...
583,opioid anesthetic general anesthetic volatil...,rs1799971,genotype gg associated increased response opio...
1317,nevirapine,rs28399499,genotype cc ct associated decreased clearance ...
277,tacrolimus,rs1045642,allele associated dose adjusted trough concent...


# Constructing the model


### Model Selection
#### Similarity Calculation
Model to check if patient profile is relevant to corresponding research summary. After much pondering on which model to use for similarity calculation, I decided to do it with tfidf, gensim similarity measures. Besides considering that this is domain specific data, own vocabulary building using tfidf and BOW seems more sensible choice. .

Gensim similarity calculation is much more optimized for sparse matrix than custom similarity (Cosine or Jaccard) comparison that I could write. I have more control over the similarity calculation than black box clustering like LDA or LSA. 

#### Justification for not going for the kill with Keras and LSTM
+ Despite of the popular belief, I consider keeping the usage of LSTM based architectures as the last resort, especially for NLP. For most cases, traditional BOW techniques work fine according to my experience.

+ Using pretrained embeddings (wordVector, fastText, gloVe) or encoder weights (BERT) is not really an option due to highly domain specific nature of the data.

Note: Although I did find an article on clinicalBERT which might be of interest, if actual product needs to be designed and with no time constraints and relaxed research timeline
https://towardsdatascience.com/how-do-they-apply-bert-in-the-clinical-domain-49113a51be50

+ To train our own embeddings or encodings, it is expected to have large training set, which clearly, I don’t have.


### Ensemble Building
Like discussed earlier, decision behind building ensemble model originated from the logic that, features like, 

+ prescribed drug
+ genetic variants (rsIDs)

should have more contribution towards final decision than features like,

+ diagnosis
+ ancestry
+ age

which could be common for large number of people, but not all of them would take the prescribed drug.
Hence, model ensemble primarily considers three features, first two being the important ones:
1. prescribed drug
2. genetic variants (rsIDs)
3. similarity measure based on diagnostic_information and generic_information (age, sex, ancestry)


In [7]:
# building vocabulary for research articles
summary_data = research_df['summary'].values.tolist() 
summary_dict = corpora.Dictionary(d.split() for d in summary_data)
f2_cnt = len(summary_dict.token2id)
summary_corpus = [summary_dict.doc2bow(row.split()) for row in summary_data]
tfidf_summary = models.TfidfModel(summary_corpus)
summary_index = similarities.SparseMatrixSimilarity(tfidf_summary[summary_corpus], num_features = f2_cnt)


#Ensemble model building
profile_relevant_research = []
top_n = 50 # find top_n hypotheses besed on profile similairty
for ind in pat_prof_df.index:
    profile= pat_prof_df['profile'][ind]
    profile = summary_dict.doc2bow(profile.split())
    prof_sim = summary_index[tfidf_summary[profile]]

    # take top n similar documents and then eliminate from them
    top_similar = np.argpartition(prof_sim, -top_n)[-top_n:]
    
    print(top_similar)
    print(prof_sim[top_similar])
    
    drug = pat_prof_df['drug'][ind]
    gen_variant = pat_prof_df['gen_variant'][ind]
    top_similar_modified = []
    for ind_2 in top_similar:
        variant_flag = False
        drug_flag = False
        res_drug = research_df['drug'][ind_2]
        res_gen_variant = research_df['gen_variant'][ind_2]
        
        # note instead of this one on one finding one could use more sophisticated NER software or 
        # even similarity calculation building new drug based BOW, see below
        for var in gen_variant:
            if  res_gen_variant == var:
                variant_flag = True
                
        if drug in res_drug:
            drug_flag = True
        
        # instead of making it conditional we can allot importance to 3 features (0.3, 0.3, 0.4) or (0.4, 0.4, 02)
        # and calculate confidance score by softmaxing similarty_score*importance
        if drug_flag or variant_flag:
            top_similar_modified.append(ind_2)
            
    profile_relevant_research.append(top_similar_modified)
         
    print(top_similar_modified)
    print(prof_sim[top_similar_modified])

profile_relevant_research

[1167 1556  140  944 1310 1252  156   29   12 1650  736 1726  865   39
 1725 1561  892  874  308  738 1248  920 1144  363  217  104  400 1759
   97   98 1447  408  407 1039 1651  893  709 1654  364  164  403  622
 1775  253  742  989  956  398  409  401]
[0.07494742 0.07595357 0.07616989 0.07970971 0.0817286  0.08426116
 0.08506846 0.08607369 0.08613807 0.14872275 0.13482411 0.12041564
 0.08633952 0.12652117 0.11231668 0.09532007 0.12925307 0.1267301
 0.09304031 0.15460253 0.09087431 0.14103013 0.0907613  0.16103905
 0.09102055 0.09017017 0.1270077  0.1548422  0.13004579 0.10925093
 0.09528142 0.11974282 0.12053156 0.09084714 0.12890793 0.09567098
 0.176169   0.2367297  0.25122806 0.30487698 0.18405755 0.21298423
 0.21543376 0.3026088  0.28115982 0.2998783  0.3027596  0.18765494
 0.19303316 0.2339108 ]
[1726, 39, 874, 920, 1651, 1775]
[0.12041564 0.12652117 0.1267301  0.14103013 0.12890793 0.21543376]
[ 877 1161  878  685 1122  109   83 1252 1235  217 1039 1310  200   73
 1505 1447  85

[[1726, 39, 874, 920, 1651, 1775], [], [1796]]

In [8]:
#Performance demo
for ind in pat_prof_df.index:
    print("Patient profile : ",pat_prof_df['profile'][ind], )
    print("Drug : ",pat_prof_df['drug'][ind])
    relevant_research = profile_relevant_research[ind]
    print("Found %d relevant research articles" % len(relevant_research))
    for doc in relevant_research:
        print("Relavant document \n", research_df['summary'][doc])
    print("\n\n\n")
        
    

Patient profile :  adult female asian ulcerative colitis inflammatory bowel disease anxiety kidney transplant paroxetine
Drug :  paroxetine
Found 6 relevant research articles
Relavant document 
 allele associated response fluvoxamine paroxetine people depressive disorder major compared allele effect serotonin type 2a 3a 3b receptor serotonin transporter gene paroxetine fluvoxamine efficacy adverse drug reaction depressed japanese patient association found comparing responder v non responder
Relavant document 
 allele associated response paroxetine people depressive disorder major compared allele g pharmacogenetics modern psychiatry
Relavant document 
 allele associated response paroxetine people depressive disorder major compared allele c pharmacogenetics modern psychiatry
Relavant document 
 genotype cc ct associated decreased response paroxetine people depression compared genotype aa abcb1 mdr1 gene polymorphism associated clinical response paroxetine patient major depressive disorde

In [82]:
# drug based BOW experiments
drug_data = research_df['drug'].values.tolist() 
drug_dict = corpora.Dictionary(d.split() for d in drug_data)
f1_cnt = len(drug_dict.token2id)
drug_corpus = [drug_dict.doc2bow(row.split()) for row in drug_data]
tfidf_drug = models.TfidfModel(drug_corpus)
drug_index = similarities.SparseMatrixSimilarity(tfidf_drug[drug_corpus], num_features = f1_cnt)

    
text1 = pat_prof_df['drug'][0]

new_vec = drug_dict.doc2bow(text1.split())


sim = drug_index[tfidf_drug[new_vec]]
print(max(sim))
max_sim = 0
for i in range(len(sim)):
    if sim[i] > 0:
        print(text1, "research results ", research_df['drug'][i])
print("most similar text", max_sim)
print(len(sim))
print(text1)
print(research_df['drug'][max_sim])

1.0
paroxetine research results  paroxetine 
paroxetine research results  paroxetine 
paroxetine research results  paroxetine 
paroxetine research results  paroxetine 
paroxetine research results  fluvoxamine  milnacipran  paroxetine 
paroxetine research results  citalopram  fluoxetine  paroxetine 
paroxetine research results  citalopram  fluoxetine  paroxetine 
paroxetine research results  fluvoxamine  paroxetine 
paroxetine research results  paroxetine 
paroxetine research results  citalopram  fluoxetine  paroxetine 
most similar text 0
1997
paroxetine
latanoprost 


In [32]:
# summary based BOW experiments
summary_data = drug_info_df['summary'].values.tolist() 
summary_dict = corpora.Dictionary(d.split() for d in summary_data)
f2_cnt = len(summary_dict.token2id)
summary_corpus = [summary_dict.doc2bow(row.split()) for row in summary_data]
tfidf_summary = models.TfidfModel(summary_corpus)
summary_index = similarities.SparseMatrixSimilarity(tfidf_summary[summary_corpus], num_features = f2_cnt)
    
text1 = pat_prof_df['generic_info'][2]
text2 = pat_prof_df['diagnostic_info'][2]

new_vec = summary_dict.doc2bow(text2.split())

print(text2)
sim = summary_index[tfidf_summary[new_vec]]
i_2 = np.argpartition(sim, -15)[-15:] 
for ind_ in i_2:
    print(sim[ind_])
    print(drug_info_df['summary'][ind_])

hiv infection type 2 diabetes metformin
0.116386056
genotype del del associated increased response benazepril perindopril people diabetes mellitus compared genotype atacagtcactttttttttttttttgagacggagtctcgctctgtcgccc atacagtcactttttttttttttttgagacggagtctcgctctgtcgccc atacagtcactttttttttttttttgagacggagtctcgctctgtcgccc del ace dd genotype susceptible ace ii id genotype antiproteinuric effect ace inhibitor patient proteinuric non insulin dependent diabetes mellitus patient del del genotype significantly greater percentage reduction baseline 3 month treatment urinary excretion protein albumin compared remaining genotype indicates greater response antiproteinuric effect benazepril perindopril
0.11842383
genotype tt associated decreased metabolism efavirenz people hiv infection compared genotype gt high plasma efavirenz concentration cyp2b6 polymorphism thai hiv 1 infection significantly higher mid dose efv plasma concentration 12 week seen patient tt genotype compared patient gt genotype
0.1