Ask for the definition of a medical term, and get a response that is reconciled between multiple web sources.  For an MVP, the web app will allow the user to enter 'heart attack' and get a machine generated reconciled definition from Wikipedia, WebMD, and MayoClinic.  Future iterations would include adding the ability to search more medical terms, using more sources with which to reconcile the requested definition, as well as including symptoms, causes, treatments, and risk factors.  Additional future iterations will include comparing sources with conflicting results with the established "training set" of sources.  These conflicting sources can be documented for additional scrutinization.

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import nltk
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import matplotlib.pyplot as plt

In [2]:
#import numpy as np
#import pandas as pd
#import nltk
#import re
import os
import codecs
#from sklearn import feature_extraction
import mpld3

In [21]:
#word2vec packages
import gensim
from gensim import corpora, models, similarities
import logging
import os

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.


In [23]:
import numpy as np
from sklearn.metrics import jaccard_similarity_score
from scipy import spatial

Keywords to use when searching for the definition of a medical term:

'term' is

'term' occurs

###Wikipedia

In [4]:
response = requests.get('https://en.wikipedia.org/wiki/Myocardial_infarction')
page = response.text
soup = BeautifulSoup(page, 'html.parser')

In [5]:
paragraphs = soup.findAll('p')
k = 0
theindex = 0
foundone = False
for i in paragraphs:
    if 'heart attack' in paragraphs[k].text:
        if foundone == False:
            theindex = k
            foundone = True
    k += 1
wikidef = paragraphs[theindex].text
wikisentences = sent_tokenize(wikidef)
wikidef

u'Myocardial infarction (MI) or acute myocardial infarction (AMI), commonly known as a heart attack, occurs when blood flow stops to a part of the heart causing damage to the heart muscle. The most common symptom is chest pain or discomfort which may travel into the shoulder, arm, back, neck, or jaw. Often it is in the center or left side of the chest and lasts for more than a few minutes. The discomfort may occasionally feel like heartburn. Other symptoms may include shortness of breath, nausea, feeling faint, a cold sweat, or feeling tired.[1] About 30% of people have atypical symptoms,[2] with women more likely than men to present atypically.[3] Among those over 75 years old, about 5% have had an MI with little or no history of symptoms.[4] An MI may cause heart failure, an irregular heartbeat, or cardiac arrest.[5][6]'

###WebMD

In [6]:
response2 = requests.get('http://www.webmd.com/heart-disease/guide/heart-disease-heart-attacks')
page2 = response2.text
soup2 = BeautifulSoup(page2, 'html.parser')

In [7]:
paragraphs2 = soup2.findAll('p')
k = 0
theindex2 = 0
foundone2 = False
for i in paragraphs2:
    if 'heart attack' in paragraphs2[k].text:
        if foundone2 == False:
            theindex2 = k
            foundone2 = True
    k += 1
webmddef = paragraphs2[theindex2].text
webmdsentences = sent_tokenize(webmddef)

13

###MayoClinic

In [8]:
response3 = requests.get('http://www.mayoclinic.org/diseases-conditions/heart-attack/basics/definition/con-20019520')
page3 = response3.text
soup3 = BeautifulSoup(page3, 'html.parser')

In [9]:
paragraphs3 = soup3.findAll('p')
k = 0
theindex3 = 0
foundone3 = False
for i in paragraphs3:
    if 'heart attack' in paragraphs3[k].text:
        if foundone3 == False:
            theindex3 = k
            foundone3 = True
    k += 1
mayodef = paragraphs3[theindex3].text
mayosentences = sent_tokenize(mayodef)
mayosentences

[u'A heart attack occurs when the flow of blood to the heart is blocked, most often by a build-up of fat, cholesterol and other substances, which form a plaque in the arteries that feed the heart (coronary arteries).',
 u'The interrupted blood flow can damage or destroy part of the heart muscle.']

###Compare all three definitions

In [8]:
allthree = [wikidef, webmddef, mayodef]
vectorizer = CountVectorizer()
allthree = sorted(allthree)
X = vectorizer.fit_transform(allthree)
#print X
X_Array = X.toarray()
#X_Array
#analyze = vectorizer.build_analyzer()
#analyze(wikidef) == wikidef

In [9]:
#pd.set_option('display.max_rows', 5)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
comparedf = pd.DataFrame(X_Array, columns=vectorizer.get_feature_names())
comparedf = comparedf.transpose()
comparedf.columns = ['Wiki', 'WebMD', 'MayoClinic']
comparedf

Unnamed: 0,Wiki,WebMD,MayoClinic
30,0,0,1
75,0,0,1
about,0,0,2
acute,0,0,1
americans,0,1,0
ami,0,0,1
among,0,0,1
an,0,0,3
and,1,1,1
arm,0,0,1


In [11]:
#comparedf[comparedf['Wiki'] & comparedf['WebMD'] & comparedf['MayoClinic'] != 0]
comparedf[comparedf['Wiki'] & comparedf['WebMD'] != 0]


Unnamed: 0,Wiki,WebMD,MayoClinic
and,1,1,1
attack,1,1,1
damage,1,1,1
heart,4,4,4
is,1,1,2
of,3,2,5
or,1,1,7
the,6,2,7
to,1,3,3


##Jaccard Similarity

In [11]:
#jaccard_similarity_score(wikidef, webmddef, normalize=False)
wikiset = set(wikidef)
webmdset = set(webmddef)
mayoset = set(mayodef)

theintersect = wikiset.intersection(webmdset)
theintersect = theintersect.intersection(mayodef)
theunion = wikiset.union(webmdset)
theunion = theunion.union(mayodef)

theintersect_len = len(list(theintersect))
theunion_len = len(list(theunion))

jaccard = float(theintersect_len) / theunion_len
if jaccard > 0.5:
    print jaccard
    astring = 'Definitions Match.\n'
    astring = astring + 'Wikipedia Definition (as confirmed by WebMD and MayoClinic)\n'
    astring = astring + wikidef
    print astring
print type(jaccard)

0.553191489362
Definitions Match.
Wikipedia Definition (as confirmed by WebMD and MayoClinic)
Myocardial infarction (MI) or acute myocardial infarction (AMI), commonly known as a heart attack, occurs when blood flow stops to a part of the heart causing damage to the heart muscle. The most common symptom is chest pain or discomfort which may travel into the shoulder, arm, back, neck, or jaw. Often it is in the center or left side of the chest and lasts for more than a few minutes. The discomfort may occasionally feel like heartburn. Other symptoms may include shortness of breath, nausea, feeling faint, a cold sweat, or feeling tired.[1] About 30% of people have atypical symptoms,[2] with women more likely than men to present atypically.[3] Among those over 75 years old, about 5% have had an MI with little or no history of symptoms.[4] An MI may cause heart failure, an irregular heartbeat, or cardiac arrest.[5][6]
<type 'float'>


In [85]:
theunion

{u' ',
 u'"',
 u'%',
 u'(',
 u')',
 u',',
 u'-',
 u'.',
 u'0',
 u'1',
 u'2',
 u'3',
 u'4',
 u'5',
 u'6',
 u'7',
 u'A',
 u'I',
 u'M',
 u'O',
 u'T',
 u'[',
 u']',
 u'a',
 u'b',
 u'c',
 u'd',
 u'e',
 u'f',
 u'g',
 u'h',
 u'i',
 u'j',
 u'k',
 u'l',
 u'm',
 u'n',
 u'o',
 u'p',
 u'q',
 u'r',
 u's',
 u't',
 u'u',
 u'v',
 u'w',
 u'y'}

##Cosine Similarity

In [42]:
#spatial.distance.cosine(wikidef, webmddef)
import math
import numpy as np
def get_cosine(text1, text2):
  vec1 = text1
  vec2 = text2
  intersection = set(vec1) & set(vec2)
  numerator = sum([vec1[x] * vec2[x] for x in intersection])
  sum1 = sum([vec1[x]**2 for x in vec1])
  sum2 = sum([vec2[x]**2 for x in vec2])
  denominator = math.sqrt(sum1) * math.sqrt(sum2)
  if not denominator:
     return 0.0
  else:
     return round(float(numerator) / denominator, 3)
#get_cosine(wikimodel['heart'], webmdmodel['heart'])
npa1 = np.array('poo', 'fart')
npa2 = np.array('poo', 'fart')
get_cosine(npa1, npa2)

TypeError: data type "fart" not understood

##Word to Vector

In [25]:
stoplist = set('for a of the and to in an or'.split())

#Wikipedia
texts = [[word for word in wikidef.lower().split() if word not in stoplist]]

#WebMD
texts2 = [[word for word in webmddef.lower().split() if word not in stoplist]]

#MayoClinic
texts3 = [[word for word in mayodef.lower().split() if word not in stoplist]]
print texts
print texts2

[[u'myocardial', u'infarction', u'(mi)', u'acute', u'myocardial', u'infarction', u'(ami),', u'commonly', u'known', u'as', u'heart', u'attack,', u'occurs', u'when', u'blood', u'flow', u'stops', u'part', u'heart', u'causing', u'damage', u'heart', u'muscle.', u'most', u'common', u'symptom', u'is', u'chest', u'pain', u'discomfort', u'which', u'may', u'travel', u'into', u'shoulder,', u'arm,', u'back,', u'neck,', u'jaw.', u'often', u'it', u'is', u'center', u'left', u'side', u'chest', u'lasts', u'more', u'than', u'few', u'minutes.', u'discomfort', u'may', u'occasionally', u'feel', u'like', u'heartburn.', u'other', u'symptoms', u'may', u'include', u'shortness', u'breath,', u'nausea,', u'feeling', u'faint,', u'cold', u'sweat,', u'feeling', u'tired.[1]', u'about', u'30%', u'people', u'have', u'atypical', u'symptoms,[2]', u'with', u'women', u'more', u'likely', u'than', u'men', u'present', u'atypically.[3]', u'among', u'those', u'over', u'75', u'years', u'old,', u'about', u'5%', u'have', u'had', u

In [37]:
%matplotlib inline
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

wikimodel = gensim.models.Word2Vec(texts, size=100, window=5, min_count=1, workers=4, sg=1)
webmdmodel = gensim.models.Word2Vec(texts2, size=100, window=5, min_count=1, workers=4, sg=1)
mayomodel = gensim.models.Word2Vec(texts3, size=100, window=5, min_count=1, workers=4, sg=1)

#dictionary = corpora.Dictionary(texts)
#corpus = [dictionary.doc2bow(text) for text in texts]
#print model['blood']
#plt.hist(model['blood'], bins=20)
#model.most_similar(positive=['stops', 'heart'], topn=5)

#print dictionary.token2id

#print corpus
#print model.vocab['over']
print wikimodel['heart']
print webmdmodel['heart']
print type(wikimodel['heart'])
print spatial.distance.cosine(wikimodel['heart'], webmdmodel['heart'])
'''
vectorlist = []
i = 0
while i < len(texts[0]):
    k = 0
    while k < len(texts2[0])
        if texts[0][i] == texts2[0][k]:
            vectorlist.append(wikimodel[texts[0][i]])
model['heart']
model.most_similar('heart')
'''

[  3.05975322e-03   3.06197355e-04   2.38542422e-03   3.92996240e-03
  -2.18411209e-04   4.90557356e-03   1.83018786e-03   3.87180527e-03
   4.86988248e-03  -2.82936613e-03  -1.85863441e-03  -7.49925384e-04
  -4.67440346e-03   2.01879558e-03  -1.85370713e-03   1.74093805e-03
  -1.17990885e-05   8.06726108e-04  -3.35370074e-03  -4.54080850e-03
   2.78646289e-03   3.89109882e-05   2.23106486e-04   4.69029974e-03
   1.46635168e-03  -2.70153256e-03   9.46767803e-04   4.68754582e-03
   4.96825390e-03  -4.96543199e-03  -2.80864048e-03   3.27006564e-05
  -2.66863662e-03  -5.15601656e-04   1.09017629e-03   1.15457631e-03
  -4.21691639e-03   4.90501802e-03  -4.79113683e-03  -4.75177774e-03
  -4.32462478e-03  -2.18355213e-03   7.72437430e-04   5.80934633e-04
  -4.87701828e-03  -3.99873266e-03   3.73888062e-03   2.46026157e-03
   3.06714163e-03   1.96975446e-03   8.55190097e-04   1.44930079e-03
   4.30834154e-03   2.97467574e-03   3.06936633e-03  -2.85251695e-03
  -2.07878556e-03  -2.13842350e-03

"\nvectorlist = []\ni = 0\nwhile i < len(texts[0]):\n    k = 0\n    while k < len(texts2[0])\n        if texts[0][i] == texts2[0][k]:\n            vectorlist.append(wikimodel[texts[0][i]])\nmodel['heart']\nmodel.most_similar('heart')\n"

In [None]:
'''
class MySentences(object):
     def __init__(self, dirname):
        self.dirname = dirname
 
     def __iter__(self):
         for fname in os.listdir(self.dirname):
                for line in open(os.path.join(self.dirname, fname)):
                    yield line.split()
'''
#sentences = MySentences('/Users/markregalla/nltk_data/corpora/gutenberg') # a memory-friendly iterator
model = gensim.models.Word2Vec(texts, min_count=3,workers=5)
#model.most_similar(positive=['attack'], negative=['heart'], topn=5)
new_doc = "myocardial"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print new_vec # the word "interaction" does not appear in the dictionary and is ignored

###LSI

In [55]:
doc_bow = [(0, 1), (1, 1)]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
corpus_lsi = lsi[corpus_tfidf]
for doc in corpus_lsi:
    print doc

[]


In [35]:
lsi.print_topics()

[(0,
  u'0.319*"may" + 0.319*"heart" + 0.160*"infarction" + 0.160*"than" + 0.160*"more" + 0.160*"have" + 0.160*"feeling" + 0.160*"myocardial" + 0.160*"with" + 0.160*"chest"')]

In [54]:
vec_bow = dictionary.doc2bow(texts2)
vec_lsi = lsi[vec_bow] # convert the query to LSI space
print(vec_lsi)
[(0, -0.461821), (1, 0.070028)]

TypeError: coercing to Unicode: need string or buffer, list found

###Use FuzzyWuzzy

In [22]:
print fuzz.token_set_ratio(wikisentences[0], mayosentences[0])

58


In [16]:
wikisentences[0]

u'Myocardial infarction (MI) or acute myocardial infarction (AMI), commonly known as a heart attack, occurs when blood flow stops to a part of the heart causing damage to the heart muscle.'

In [17]:
mayosentences[0]

u'A heart attack occurs when the flow of blood to the heart is blocked, most often by a build-up of fat, cholesterol and other substances, which form a plaque in the arteries that feed the heart (coronary arteries).'

###Text summarization:
http://glowingpython.blogspot.com/2014/09/text-summarization-with-nltk.html

In [36]:
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest

class FrequencySummarizer:
  def __init__(self, min_cut=0.1, max_cut=0.9):
    """
     Initilize the text summarizer.
     Words that have a frequency term lower than min_cut 
     or higer than max_cut will be ignored.
    """
    self._min_cut = min_cut
    self._max_cut = max_cut 
    self._stopwords = set(stopwords.words('english') + list(punctuation))

  def _compute_frequencies(self, word_sent):
    """ 
      Compute the frequency of each of word.
      Input: 
       word_sent, a list of sentences already tokenized.
      Output: 
       freq, a dictionary where freq[w] is the frequency of w.
    """
    freq = defaultdict(int)
    for s in word_sent:
      for word in s:
        if word not in self._stopwords:
          freq[word] += 1
    # frequencies normalization and fitering
    m = float(max(freq.values()))
    for w in freq.keys():
      freq[w] = freq[w]/m
      if freq[w] >= self._max_cut or freq[w] <= self._min_cut:
        del freq[w]
    return freq

  def summarize(self, text, n):
    """
      Return a list of n sentences 
      which represent the summary of text.
    """
    sents = sent_tokenize(text)
    
    #assert that the number of sentences in the paragraph to summarize 
    #is greater than the number of desired summary sentences
    #remove if you run across a paragraph to summarize that is only 1 sentence
    #assert n <= len(sents)        
    
    word_sent = [word_tokenize(s.lower()) for s in sents]
    self._freq = self._compute_frequencies(word_sent)
    ranking = defaultdict(int)
    for i,sent in enumerate(word_sent):
      for w in sent:
        if w in self._freq:
          ranking[i] += self._freq[w]
    sents_idx = self._rank(ranking, n)    
    return [sents[j] for j in sents_idx]

  def _rank(self, ranking, n):
    """ return the first n sentences with highest ranking """
    return nlargest(n, ranking, key=ranking.get)

In [3]:
#To test the summarizer, let's create a function that extract the natural language 
#from a html page using BeautifulSoup:
import urllib2
from bs4 import BeautifulSoup

def get_only_text(url):
 """ 
  return the title and the text of the article
  at the specified url
 """
 page = urllib2.urlopen(url).read().decode('utf8')
 soup = BeautifulSoup(page)
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 return soup.title.text, text

In [37]:
#We can finally apply our summarizer on a set of articles extracted from the BBC news feed:
feed_xml = urllib2.urlopen('https://en.wikipedia.org/wiki/Myocardial_infarction').read()

#response = requests.get('https://en.wikipedia.org/wiki/Myocardial_infarction')
#page = response.text
#soup = BeautifulSoup(page, 'html.parser')

#feed = BeautifulSoup(feed_xml.decode('utf8'), 'html.parser')
feed = BeautifulSoup(page, 'html.parser')

to_summarize = map(lambda p: p.text, feed.find_all('p'))
print type(to_summarize)
i = 0
fs = FrequencySummarizer()
for article_url in to_summarize:  #to_summarize is a list of text to be summarized
  #title, text = get_only_text(article_url)
  print '----------------------------------'
  #print title
  print i
  #for s in fs.summarize(text, 2):
  for s in fs.summarize(article_url, 2):  #2nd argument is desired number of summary sentences
      print '*', s                        #if there is only 1 sentence to summarize, assertion error will return
  i += 1

print len(to_summarize)

<type 'list'>
----------------------------------
0
* Myocardial infarction (MI) or acute myocardial infarction (AMI), commonly known as a heart attack, occurs when blood flow stops to a part of the heart causing damage to the heart muscle.
* Other symptoms may include shortness of breath, nausea, feeling faint, a cold sweat, or feeling tired.
----------------------------------
1
* [5] MIs are less commonly caused by coronary artery spasms, which may be due to cocaine, significant emotional stress, and extreme cold, among others.
* [5] Risk factors include high blood pressure, smoking, diabetes, lack of exercise, obesity, high blood cholesterol, poor diet, and excessive alcohol intake, among others.
----------------------------------
2
* [12] In ST elevation MIs treatments which attempt to restore blood flow to the heart are typically recommended and include angioplasty, where the arteries are pushed open, or thrombolysis, where the blockage is removed using medications.
* [2] People wh