Task : Finding adverse drug reactions(ADR) or side-effects of a drug from electronic health records using NLP.

In [None]:
import scispacy
import spacy                                                                
import en_core_sci_sm                                                       
from spacy import displacy                                                 
import pandas as pd

In [None]:
df = pd.read_csv("/content/assignment_data.csv")
df.head()

Unnamed: 0,SetID,Adverse Reactions,Summary
0,a834d1cf-72fc-93bf-e053-2995a90a6191,The following adverse events were observed and...,
1,a835b697-2beb-1ba8-e053-2995a90a470c,The following serious adverse reactions are de...,
2,a837f13e-fafc-0535-e053-2995a90a5070,ADVERSE REACTIONS Clinical Trials Experience I...,
3,a838204b-9564-9aa6-e053-2a95a90af02f,ADVERSE REACTIONS Clinical Trials Experience I...,
4,f265e6dd-f47e-4511-9468-282184bcd1b1,The most common adverse reactions leading to d...,


In [None]:
df2 = pd.read_csv("/content/example_output.csv")
df2.head()

Unnamed: 0,SetID,Adverse Reactions,Summary
0,632bb50c-3bcb-4c85-9056-fc33410550ae,The most common adverse reactions including la...,"Leukopenia lymphopenia, fatigue, anemia, neutr..."
1,723d9f78-9d77-4575-af27-1aa117e6b8d7,ADVERSE REACTIONS Adverse reactions to isosorb...,"Headache, lightheadedness in response to blood..."
2,8589d376-ac10-4ddb-9c53-2e0c8d5675c4,The most common adverse reactions (incidence 5...,"Instillation-site irritation, dysegeusia, decr..."
3,9087c92f-c753-4bd4-82e4-5aeee31e0ec3,Most common adverse reactions (>>10%): constip...,"constipation, nausea, and sedation."
4,a500b8db-fed5-7a0e-e053-2995a90ab877,Most common adverse reaction to amlodipine is ...,"Edema. Fatigue, nausea, abdominal pain, and s..."


Pre-processing

In [None]:
import regex as re
#Cleaning
def clean(text):
    
    # removing paragraph numbers
    text = re.sub('[0-9]+.\t','',str(text))
    # removing new line characters
    text = re.sub('\n ','',str(text))
    # removing apostrophes
    text = re.sub("'s",'',str(text))
    # removing hyphens
    text = re.sub("-",' ',str(text))
    text = re.sub("— ",'',str(text))
    # removing quotation marks
    text = re.sub('\"','',str(text))
    text = re.sub('>','',str(text))
    #removing everything within brackets
    text = re.sub(r'\([^)]*\)', '', str(text))
    
    return text


df['Adverse Reactions'] = df['Adverse Reactions'].apply(clean)
df2['Adverse Reactions'] = df2['Adverse Reactions'].apply(clean)

 ---------------------------------------

Stopword removal can be done but I have avoided doing that in order to preserve the contextual information necessary for the parser used in the following steps.

--------------------------------------------------------------

In [None]:
text = df2['Adverse Reactions'][0]
text

'The most common adverse reactions including laboratory abnormalities  are leukopenia lymphopenia fatigue anemia neutropenia increased creatinine increased alanine aminotransferase increased glucose thrombocytopenia nausea decreased appetite musculoskeletal pain decreased albumin constipation dyspnea decreased sodium increased aspartate aminotransferase vomiting cough decreased magnesium and diarrhea. ('

In [None]:
#Tokenize the data into words using spacy
import spacy
from spacy import displacy 
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

### 1. Rule-based approach

By looking at the task given, this seemed to me to be a Named Entity Recognition problem in NLP. My first instinct was extracting disease names (as proper nouns) and adjective+noun pairs (e.g. increased pain) using spacy's dependency parser.


In [None]:
#After looking at the POS tags, this approach was devised
adjnounpairs = []
for possible_subject in doc:
    if possible_subject.dep_ == 'amod' and possible_subject.head.pos_ == 'NOUN':
        adjnounpairs.append([possible_subject, possible_subject.head])
    if possible_subject.dep_ == 'amod' and possible_subject.head.pos_ == 'PROPN':
        adjnounpairs.append([possible_subject, possible_subject.head])
    if possible_subject.pos_ == 'PROPN':       
        adjnounpairs.append([possible_subject])
print(adjnounpairs)

[[common, reactions], [adverse, reactions], [leukopenia], [lymphopenia], [fatigue], [anemia], [neutropenia], [creatinine], [increased, aminotransferase], [alanine], [glucose], [thrombocytopenia], [appetite], [musculoskeletal, pain], [dyspnea, sodium], [decreased, sodium], [aspartate, cough]]


As seen above, I am able to extract proper nouns properly but the adjective+nouns pairs are not detected since spacy's dependency parser does not fit well for our data. Look at the output of dependency tree. The relationship between **decreased** and **appetite** is not captured. Hence, I moved on to pattern matching for adjective+nouns pairs using spacy Matcher.

In [None]:
displacy.render(doc, style='dep',jupyter=True)

Using pattern matching :

In [None]:
from spacy.matcher import Matcher 
#For capturing patterns starting with increased
matcher = Matcher(nlp.vocab)
pattern1 = [{'LOWER': 'increased'},
          {'POS': 'PROPN'}]
matcher.add("matching_1", None, pattern1)
matches = matcher(doc)
span = doc[matches[0][1]:matches[0][2]]
print(span.text)

#For capturing patterns starting with decreased
matcher = Matcher(nlp.vocab)
pattern2 = [{'LOWER': 'decreased'},
           {'POS': 'PROPN'}]
matcher.add("matching_1", None, pattern2)
matches = matcher(doc)
span = doc[matches[0][1]:matches[0][2]]
print(span.text)


increased creatinine
decreased appetite


Overall, we can see that rule-based approach will not work as the vocabulary increases and representation of words keeps on changing. (e.g. increased creatinine may change to reduced creatinine in future, for which the pattern matching approach will fail)

### 2. Supervised Machine Learning

In this approach, we can manually create a custom dictionary of common adverse drug reactions (ADR) from scratch. {e.g. A dictionary that contain words like musculoskeletal pain, fatigue, nausea etc.) Once we have this labelled training data, NER becomes a word classification problem where each word of the sentence has to be classified as ADR or non-ADR. We can use supervised machine learning models like svm and crf for this. (I haven't implemented this in given time, but it can be though of as an approach). 

These approaches require that we have a predefined set of output labels or named entities, which we do not have.

### 3. Named Entity Recognition using Sci-spacy

scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text. The following are entities extracted by the default model of scispacy **'en_core_sci_sm'**<br>
Reference : https://allenai.github.io/scispacy/

In [None]:
nlp = en_core_sci_sm.load()
doc = nlp(text)
displacy_image = displacy.render(doc, jupyter = True, style = 'ent')

As seen above, this model fails to capture desired entities like **cough** and captures some unwanted ones. Out of scispacy's 7  biomedical models, the **en_ner_bc5cdr_md** model extracts DISEASE and CHEMICAL as entities. Since our use-case is extraction of adverse drug reactions, we keep the DISEASE entities but discard the CHEMICAL entities. 

In [None]:
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_bc5cdr_md-0.2.4.tar.gz

In [None]:
import en_ner_bc5cdr_md
nlp = en_ner_bc5cdr_md.load()
doc = nlp(text)
displacy_image = displacy.render(doc, jupyter = True, style = 'ent')

### 4. Using SparkNLP

In 2020 itself, the developers at John Snow Labs did an intensive research to gather all the available ADR datasets (PsyTAR, CADEC, Drug-AE, TwiMed). They then trained several Named Entity Recognition (NER) models in Spark NLP, using BioBert language models and released as a pretrained model and pipeline with Spark NLP Enterprise 2.6.2 release. I have used their pretrained model under free license. This has by far given the most accuracte results. <br>
Reference : https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb

In [None]:
import json

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

license_keys.keys()

Saving spark_nlp_for_healthcare.json to spark_nlp_for_healthcare.json


dict_keys(['SECRET', 'SPARK_NLP_LICENSE', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY', 'JSL_VERSION', 'PUBLIC_VERSION'])

In [None]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

secret = license_keys['SECRET']

os.environ['SPARK_NLP_LICENSE'] = license_keys['SPARK_NLP_LICENSE']
os.environ['AWS_ACCESS_KEY_ID']= license_keys['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY'] = license_keys['AWS_SECRET_ACCESS_KEY']
version = license_keys['PUBLIC_VERSION']
jsl_version = license_keys['JSL_VERSION']

! pip install --ignore-installed -q pyspark==2.4.4

! python -m pip install --upgrade spark-nlp-jsl==$jsl_version  --extra-index-url https://pypi.johnsnowlabs.com/$secret

! pip install --ignore-installed -q spark-nlp==$version

import sparknlp

print (sparknlp.version())

import json
import os
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl

spark = sparknlp_jsl.start(secret)

openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-8u275-b01-0ubuntu1~18.04-b01)
OpenJDK 64-Bit Server VM (build 25.275-b01, mixed mode)
[K     |████████████████████████████████| 215.7MB 63kB/s 
[K     |████████████████████████████████| 204kB 47.5MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://pypi.johnsnowlabs.com/2.7.0-e49fd2919a73690cd04ed3b6e223e21330c5214b
Collecting spark-nlp-jsl==2.7.0
[?25l  Downloading https://pypi.johnsnowlabs.com/2.7.0-e49fd2919a73690cd04ed3b6e223e21330c5214b/spark-nlp-jsl/spark_nlp_jsl-2.7.0-py3-none-any.whl (44kB)
[K     |████████████████████████████████| 51kB 4.8MB/s 
[?25hCollecting spark-nlp==2.6.3
[?25l  Downloading https://files.pythonhosted.org/packages/84/84/3f15673db521fbc4e8e0ec3677a019ba1458b2cb70f0f7738c221511ef32/spark_nlp-2.6.3-py2.py3-none-any.whl (129kB)
[K     |████████████████████████████████| 133kB 11.9MB/s 
[?25hInstalling collecte

In [None]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

ade_ner = NerDLModel.pretrained("ner_ade_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

ner_converter = NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")

ner_pipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ade_ner,
    ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")
print(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_ade_clinical download started this may take some time.
Approximate size to download 13.8 MB
[OK!]
DataFrame[text: string]


In [None]:
ade_ner_model = ner_pipeline.fit(empty_data)

ade_ner_lp = LightPipeline(ade_ner_model)

In [None]:
light_result = ade_ner_lp.fullAnnotate(text)
print(light_result[0].keys())

dict_keys(['document', 'ner_chunk', 'token', 'ner', 'embeddings', 'sentence'])


In [None]:
light_result[0]['ner_chunk']

[Annotation(chunk, 74, 83, leukopenia, {'entity': 'ADE', 'sentence': '0', 'chunk': '0'}),
 Annotation(chunk, 85, 95, lymphopenia, {'entity': 'ADE', 'sentence': '0', 'chunk': '1'}),
 Annotation(chunk, 97, 103, fatigue, {'entity': 'ADE', 'sentence': '0', 'chunk': '2'}),
 Annotation(chunk, 105, 110, anemia, {'entity': 'ADE', 'sentence': '0', 'chunk': '3'}),
 Annotation(chunk, 112, 122, neutropenia, {'entity': 'ADE', 'sentence': '0', 'chunk': '4'}),
 Annotation(chunk, 124, 143, increased creatinine, {'entity': 'ADE', 'sentence': '0', 'chunk': '5'}),
 Annotation(chunk, 145, 161, increased alanine, {'entity': 'ADE', 'sentence': '0', 'chunk': '6'}),
 Annotation(chunk, 163, 178, aminotransferase, {'entity': 'ADE', 'sentence': '0', 'chunk': '7'}),
 Annotation(chunk, 180, 188, increased, {'entity': 'ADE', 'sentence': '0', 'chunk': '8'}),
 Annotation(chunk, 190, 196, glucose, {'entity': 'ADE', 'sentence': '0', 'chunk': '9'}),
 Annotation(chunk, 198, 213, thrombocytopenia, {'entity': 'ADE', 'sente

In [None]:
chunks = []
entities = []
begin =[]
end = []

for n in light_result[0]['ner_chunk']:

    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 

import pandas as pd

df = pd.DataFrame({'chunks':chunks, 'entities':entities,
                    'begin': begin, 'end': end})
df

Unnamed: 0,chunks,entities,begin,end
0,leukopenia,ADE,74,83
1,lymphopenia,ADE,85,95
2,fatigue,ADE,97,103
3,anemia,ADE,105,110
4,neutropenia,ADE,112,122
5,increased creatinine,ADE,124,143
6,increased alanine,ADE,145,161
7,aminotransferase,ADE,163,178
8,increased,ADE,180,188
9,glucose,ADE,190,196


Papers referred : <br>
https://link.springer.com/article/10.1007%2Fs40264-018-0763-y <br>

https://kpfu.ru/staff_files/F123938974/Automated_Detection_of_Adverse_Drug_Reactions_From_Social_Media_Posts_With_Machine_Learning.pdf <br>

Based on these research papers, LSTMs as well as CNNs can also be used for this task.

