# Classifying Documents

In this notebook we demonstrate a basic document level classification of reports with respect to a single finding ( fever). We leverage the convenience of Pandas to read our data from a SQLite database and then use Pandas to add our classification as a new column in the dataframe.

Many of the common pyConTextNLP tasks have been wrapped into functions contained in the [``radnlp``](https://github.com/chapmanbe/RadNLP) pacakge. We important multiple modules that will allow us to write concise code.

In [2]:
import radnlp

In [5]:
import pyConTextNLP.pyConTextGraph as pyConText
import pyConTextNLP.itemData as itemData
import os
import radnlp.io  as rio
import radnlp.view as rview
import radnlp.rules as rules
import radnlp.schema as schema
import radnlp.utils as utils
import radnlp.split as split
import radnlp.classifier as classifier
import sqlite3 as sq
import pandas as pd
from IPython.display import clear_output, display, HTML, Image
from IPython.html.widgets import interact, interactive, fixed
from IPython.display import clear_output
import ipywidgets as widgets
import seaborn as sns
import matplotlib.pyplot as plt
from radnlp.data import classrslts 
import networkx as nx

In [6]:
colors={"pulmonary_embolism":"blue",
       "definite_negated_existence":"red",
       "probable_negated_existence":"indianred",
       "ambivalent_existence":"orange",
       "probable_existence":"forestgreen",
       "definite_existence":"green",
       "historical":"goldenrod",
       "indication":"Pink",
       "acute":"golden"}

In [7]:
import radnlp
radnlp.__version__

'0.2.0.8'

### Explanation of ``getOptions``

This is just kind of a port of a command line application where I'd use argparse to get the options.


In [17]:
def getOptions():
    """Generates arguments for specifying database and other parameters"""
    options = {}
    options['infile'] = os.path.join("data/pitt_reports.sqlite")
    options['lexical_kb'] = ["https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/lexical_kb_05042016.tsv"]
    options['domain_kb'] = ["https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/pe_kb.tsv"]
    options["schema"] = "https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/schema2.csv"
    options["rules"] = "https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/classificationRules3.csv" 
    return options

#### Explation of ``get_kb_rules_schema``

* ``itemData.instantiateFromCSVtoitemData``: This function is a somewhat unfortunate name as it implies reading CSV files but we've moved to tab delimited files, since we sometimes have commas in the regular expressions.


In [19]:
def get_kb_rules_schema(options):
    """
    Get the relevant kb, rules, and schema.
    
    """
    _radnlp_rules = rules.read_rules(options["rules"])
    _schema = schema.read_schema(options["schema"])
    
    modifiers = itemData.itemData()
    targets = itemData.itemData()
    for kb in options['lexical_kb']:
        modifiers.extend( itemData.instantiateFromCSVtoitemData(kb) )
    for kb in options['domain_kb']:
        targets.extend( itemData.instantiateFromCSVtoitemData(kb) )
    return {"rules":_radnlp_rules,
            "schema":_schema,
            "modifiers":modifiers,
            "targets":targets}
    

In [10]:
def analyze_report(report, modifiers, targets, rules, schema):
    """
    given an individual radiology report, creates a pyConTextGraph
    object that contains the context markup
    report: a text string containing the radiology reports
    """
    markup = utils.mark_report(split.get_sentences(report),
                         modifiers,
                         targets)
    
    clssfy =   classifier.classify_document_targets(markup,
                                          rules[0],
                                          rules[1],
                                          rules[2],
                                          schema)
    return classrslts(context_document=markup, exam_type="ctpa", report_text=report, classification_result=clssfy)

#### Alternatively, do each step separately

In [20]:

def mark_report(report, modifiers, targets):
    """
    given an individual radiology report, creates a pyConTextGraph
    object that contains the context markup
    report: a text string containing the radiology reports
    """
    
    markup = utils.mark_report(split.get_sentences(report),
                         modifiers,
                         targets)
    return markup
def classify_report(markup, rules, schema):

    return  classifier.classify_document_targets(markup,
                                          rules[0],
                                          rules[1],
                                          rules[2],
                                          schema)

In [21]:
def get_data():
    options = getOptions()
    kb = get_kb_rules_schema(options)
    conn = sq.connect(options['infile'])
    data = pd.read_sql("""SELECT * FROM reports""", conn)
    return data, kb



In [22]:
data, kb = get_data()
#data = data.dropna()

In [23]:
from collections import defaultdict

utahData = defaultdict(lambda: defaultdict(list))

for index, row in data.iterrows():
    if(row['disease_state']!=None and row['disease_state']!='NULL' ):
        utahData['disease_state']['text'].append(row['impression'])
        utahData['disease_state']['label'].append(row['disease_state'])
    if(row['uncertainty']!=None and row['uncertainty']!='NULL'):
        utahData['uncertainty']['text'].append(row['impression'])
        utahData['uncertainty']['label'].append(row['uncertainty'])
    if(row['quality']!=None and row['quality']!='NULL'):
        utahData['quality']['text'].append(row['impression'])
        utahData['quality']['label'].append(row['quality'])
    if(row['historicity']!=None and row['historicity']!='NULL'):
        utahData['historicity']['text'].append(row['impression'])
        utahData['historicity']['label'].append(row['historicity'])

### Document Classification

We now need to apply our schema to the reports. Since our data is in a Pandas data frame, the easiest way to process our reports is with the DataFrame [``apply``](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) method.

* We use ``lambda`` to create an anonymous function which basically just applies ``analyze_report`` to the ``"impression"`` column with the modifiers, targets, etc. that we have read in separately.
* ``analyze_report`` returns a dictionary with ``keys`` as any identified targets defined in the ``"targets"`` file and values as a tuple with values:
    * The schema value that was selected for the document
    * The node (evidence) that was used for selecting that schema value
    

In [19]:
data = pd.read_csv("/Users/YY/Dropbox/Yuyan PE project/ipython_notebooks/outputMerged.csv")

In [50]:
import csv
inFile1 = "/Users/YY/Dropbox/Yuyan PE project/ipython_notebooks/outputMerged.csv"
files = []
with open(inFile1) as tsvfile:
    next(tsvfile)
    tsvreader = csv.reader(tsvfile)
    for line in tsvreader:
        files.append(line)

In [20]:
data.loc[[4646]]

Unnamed: 0,pat_deid,order_deid,days_age_at_ct,rad_report,impression,batch,disease_state_label,uncertainty_label,quality_label,historicity_label,disease_state_prob,uncertainty_prob,quality_prob,historicity_prob
4646,ML_PE55190,410546583,4042,This exam contains no SHC radiology report. Pl...,MISSING,,,,,,0.101184,0.03013,0.9936,0.9994


In [15]:
data =data.dropna(subset=['impression'])

In [59]:
for index,f in enumerate(files):
    try:
        files[index].append(int(dataOut['pe rslt'][index].classification_result['pulmonary_embolism'][0] == 8))
    except:
        files[index].append(0)

In [58]:
dataOut["pe rslt"][index].classification_result

{}

In [62]:
fa = []
with open(inFile1) as tsvfile:
    tsvreader = csv.reader(tsvfile)
    for line in tsvreader:
        fa.append(line)

In [64]:
fa[0].append("disease_PEfinder")

In [69]:
fa[0]

['pat_deid',
 'order_deid',
 'days_age_at_ct',
 'rad_report',
 'impression',
 'batch',
 'disease_state_label',
 'uncertainty_label',
 'quality_label',
 'historicity_label',
 'disease_state_prob',
 'uncertainty_prob',
 'quality_prob',
 'historicity_prob',
 'disease_PEfinder']

In [71]:
import csv 
with open("outputFinal.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerow(fa[0])
    writer.writerows(files)

In [28]:
data["pe rslt"] = \
data.apply(lambda x: analyze_report(x["impression"], 
                                     kb["modifiers"], 
                                     kb["targets"],
                                     kb["rules"],
                                     kb["schema"]), axis=1)

In [25]:
data['pe rslt']

0         (__________________________________________\n,...
1         (__________________________________________\n,...
2         (__________________________________________\n,...
3         (__________________________________________\n,...
4         (__________________________________________\n,...
5         (__________________________________________\n,...
6         (__________________________________________\n,...
7         (__________________________________________\n,...
8         (__________________________________________\n,...
9         (__________________________________________\n,...
10        (__________________________________________\n,...
11        (__________________________________________\n,...
12        (__________________________________________\n,...
13        (__________________________________________\n,...
14        (__________________________________________\n,...
15        (__________________________________________\n,...
16        (_____________________________

In [23]:
for i in range(len(data)):
    dataIn = data.loc[[i]]
    dataIn["pe rslt"] = \
    dataIn.apply(lambda x: analyze_report(x["impression"], 
                                         kb["modifiers"], 
                                         kb["targets"],
                                         kb["rules"],
                                         kb["schema"]), axis=1)

TypeError: ("The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>", 'occurred at index 4646')

In [26]:
data.loc[[4646]]

Unnamed: 0,pat_deid,order_deid,days_age_at_ct,rad_report,impression,batch,disease_state_label,uncertainty_label,quality_label,historicity_label,disease_state_prob,uncertainty_prob,quality_prob,historicity_prob
4646,ML_PE55190,410546583,4042,This exam contains no SHC radiology report. Pl...,,,,,,,0.101184,0.03013,0.9936,0.9994


In [57]:
data = data[np.isfinite(data['impression'])]

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [66]:
data =data.dropna(subset=['impression'])

In [67]:
len(data)

117482

In [64]:
df = df[pd.notnull(df['impression'])]

In [54]:
data.dropna(axis="impression")

ValueError: No axis named impression for object type <class 'pandas.core.frame.DataFrame'>

In [52]:
data.loc[[4646]]

Unnamed: 0,pat_deid,order_deid,days_age_at_ct,rad_report,impression,batch,disease_state_label,uncertainty_label,quality_label,historicity_label,disease_state_prob,uncertainty_prob,quality_prob,historicity_prob,pe rslt
4646,ML_PE55190,410546583,4042,This exam contains no SHC radiology report. Pl...,,,,,,,0.101184,0.03013,0.9936,0.9994,


In [62]:
df

Unnamed: 0,pat_deid,order_deid,days_age_at_ct,rad_report,impression,batch,disease_state_label,uncertainty_label,quality_label,historicity_label,disease_state_prob,uncertainty_prob,quality_prob,historicity_prob,pe rslt
0,ML_PE55213,789087,2772,"CT CHEST, ABDOMEN AND PELVIS WITH CONTRAST: 0...","-***-1. SMALL, NONSPECIFIC HYPOATTENUATING L...",,,,,,3.507866e-04,1.233549e-03,0.8628,0.9124,
1,ML_PE54737,249891,822,CT ANGIO OF THE CHEST WITH INTRAVENOUS CONTRAS...,-***-1. CT ANGIOGRAM OF THE CHEST WITHIN THE ...,,,,,,2.407167e-02,3.893098e-06,0.7488,0.8904,
2,ML_PE54001,590371,6928,CT OF THE CHEST WITH CONTRAST: 12/05/2011.-*...,-***-1. THERE IS NO EVIDENCE OF INTRATHORACI...,,,,,,2.910765e-06,5.524144e-02,0.8858,0.9042,
3,ML_PE54363,646632,5992,CT OF THE CHEST WITH CONTRAST: 03/21/2012 -**...,-***-1. OVERALL MARKED DECREASE IN SIZE AND N...,,,,,,4.390231e-02,1.590347e-04,0.9434,0.9156,
4,ML_PE54565,649996,6299,CT OF THE CHEST WITH CONTRAST-***-FINDINGS: L...,-***-1. TINY NONSPECIFIC AREA OF VAGUE GROUND ...,,,,,,2.350871e-05,3.753931e-03,0.8602,0.8998,
5,ML_PE54834,697295,957,CT OF THE CHEST-***-There is no cardiomegaly o...,-***-THERE ARE TWO NONSPECIFIC PULMONARY NODUL...,,,,,,1.268217e-03,2.891451e-03,0.8258,0.9562,
6,ML_PE43130,728626,3792,CT OF THE CHEST WITH CONTRAST: 05/14/2012 -**...,-***-1. INTERVAL RESOLUTION OF THE MEDIASTINA...,,,,,,1.616828e-06,1.077680e-02,0.9332,0.8640,
7,ML_PE55072,731908,5289,CT SCAN OF THE CHEST WITH CONTRAST-***-FINDING...,-***-STATUS POST RESECTION OF A MEDIASTINAL MA...,,,,,,6.691039e-08,8.684614e-02,0.9240,0.8774,
8,ML_PE56832,732770,4850,CT ANGIOGRAM CHEST WITH AND WITHOUT CONTRAST: ...,-***-1. NATIVE ANATOMY COMPATIBLE WITH TRICUS...,,,,,,7.022522e-05,1.068863e-03,0.8224,0.9496,
9,ML_PE53569,763749,2387,CT ANGIOGRAM OF THE CHEST WITH AND WITHOUT CON...,-***-1. MILD RIGHT VENTRICULAR HYPERTROPHY WI...,,,,,,1.374909e-04,1.142393e-03,0.8788,0.8656,


In [38]:
data["pe rslt"] = \
    dataIn.apply(lambda x: analyze_report(x["impression"], 
                                         kb["modifiers"], 
                                         kb["targets"],
                                         kb["rules"],
                                         kb["schema"]), axis=1)

In [36]:
data[1000:1001]

Unnamed: 0,pat_deid,order_deid,days_age_at_ct,rad_report,impression,batch,disease_state_label,uncertainty_label,quality_label,historicity_label,disease_state_prob,uncertainty_prob,quality_prob,historicity_prob,pe rslt
1000,ML_PE54992,834725,209,CT ANGIOGRAM OF THE CHEST WITH AND WITHOUT CON...,"-***-1. NO EVIDENCE OF PULMONARY EMBOLISM, AS...",,,,,,0.003841,0.000418,0.832,0.9382,"(__________________________________________\n,..."


In [22]:
data['pe rslt'][0].classification_result

{'pulmonary_embolism': (2,
  "\n<tagObject>\n<id> 236408518176355198228383601059127484118 </id>\n<phrase> PULMONARY EMBOLISM </phrase>\n<literal> pulmonary embolism </literal>\n<category> ['pulmonary_embolism'] </category>\n<spanStart> 3 </spanStart>\n<spanStop> 21 </spanStop>\n<scopeStart> 0 </scopeStart>\n<scopeStop> 22 </scopeStop>\n</tagObject>\n",
  [])}

In [41]:
data["pe rslt"][1].classification_result

{'pulmonary_embolism': (8,
  "\n<tagObject>\n<id> 236419490326125473666791127421117652694 </id>\n<phrase> PULMONARY EMBOLISM </phrase>\n<literal> pulmonary embolism </literal>\n<category> ['pulmonary_embolism'] </category>\n<spanStart> 34 </spanStart>\n<spanStop> 52 </spanStop>\n<scopeStart> 0 </scopeStart>\n<scopeStop> 166 </scopeStop>\n</tagObject>\n",
  [])}

In [42]:
data["pe rslt"][2].classification_result

{'pulmonary_embolism': (8,
  "\n<tagObject>\n<id> 236426696285962471037922250789613592278 </id>\n<phrase> PULMONARY EMBOLISM </phrase>\n<literal> pulmonary embolism </literal>\n<category> ['pulmonary_embolism'] </category>\n<spanStart> 78 </spanStart>\n<spanStop> 96 </spanStop>\n<scopeStart> 0 </scopeStart>\n<scopeStop> 97 </scopeStop>\n</tagObject>\n",
  [])}

In [32]:
data

Unnamed: 0,id,impression,disease_state,uncertainty,quality,historicity,pe rslt
0,70,\n[Report de-identified (Limited dataset compl...,Neg,No,Diagnostic,,"(__________________________________________\n,..."
1,71,\n[Report de-identified (Limited dataset compl...,Pos,No,Diagnostic,New,"(__________________________________________\n,..."
2,72,\n[Report de-identified (Limited dataset compl...,Pos,Yes,Diagnostic,New,"(__________________________________________\n,..."
3,73,\n[Report de-identified (Limited dataset compl...,Neg,No,Diagnostic,,"(__________________________________________\n,..."
4,74,\n[Report de-identified (Limited dataset compl...,Neg,Yes,Not Diagnostic,,"(__________________________________________\n,..."
5,75,\n[Report de-identified (Limited dataset compl...,Pos,No,Diagnostic,New,"(__________________________________________\n,..."
6,76,\n[Report de-identified (Limited dataset compl...,Pos,No,Diagnostic,New,"(__________________________________________\n,..."
7,77,\n[Report de-identified (Limited dataset compl...,Neg,Yes,Diagnostic,,"(__________________________________________\n,..."
8,78,\n[Report de-identified (Limited dataset compl...,Neg,No,Diagnostic,,"(__________________________________________\n,..."
9,79,\n[Report de-identified (Limited dataset compl...,Pos,Yes,Not Diagnostic,New,"(__________________________________________\n,..."


In [61]:
def view_markup(reports, colors):
    @interact(i=widgets.IntSlider(min=0, max=len(reports)-1))
    def _view_markup(i):
        markup = reports["pe rslt"][i]
        rview.markup_to_pydot(markup)
        display(Image("tmp.png"))
        mt = rview.markup_to_html(markup, color_map=colors)

        display(HTML(mt))

In [62]:
view_markup(data, colors)

KeyError: 0

## Known Issues

### pyConTextNLP does not deal with coreference
#### Example: Report 776
* Since we deal with each sentence independently, we miss the historical modifiers in the subsequent sentences
* graphviz does not seem to draw directional graphs with single nodes
* The ``radnlp`` package is very rough and I'll be working on it extensively.
