# Diagnosis extraction from medical reports
### Kivotova Evgenia, B17-DS-01
#### Innopolis University, 2020

In [36]:
import pandas as pd
from bs4 import BeautifulSoup, NavigableString, Tag
from pathlib import Path
from typing import List

## Datasets paths definition

In [2]:
data_dir = Path('datasets')
openi_reports = data_dir / "ecgen-radiology"

## OpenI dataset analysis

The OpenI reports are stored in `.xml` format with `UTF-8` encoding.

The body of each report is stored in `<MedlineCitation>` section, namely `<Abstract>` tag. 

Each report consists of 4 parts:
 - `<AbstractText Label="COMPARISON">`- Comparison of the new imaging exam with any available previous exams.
 - `<AbstractText Label="INDICATION">` - The reason for examination or important patient information
 - `AbstractText Label="FINDINGS">` - Detailed  descriptions about normal and abnormal findings
 - `AbstractText Label="IMPRESSION">` - Diseases from Findings and forms a diagnostic conclusion, consisting of abnormal and normal conclusions.
 
Therefore, the **Impression** section is the most valuable one as it contains exact referencies to the diagnosis.

The other interesting section of this reports is `<MeSH>` tag.

**MeSH** stands for Medical Subject Headings, a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. It is used for indexing, cataloging, and searching of biomedical and health-related information.

It seems to have the keywords about full report that are related to the Impression as well as to the general patient description. 

**BUT!** The `<major>` tag with `normal` value does not overlab with any other tag.

## Reports parsing using BeautifulSoup

In this section the reports are parsed into pandas dataframe, extracting important fields and discarding unnesessary information:

In [38]:
def parse_report(path:Path) -> BeautifulSoup:
    
    # Check if report path exists
    if not path.exists():
        return None
    
    
    with open(path, 'r') as report_file:
        raw_report = report_file.read()
        parsed_report = BeautifulSoup(raw_report, "lxml")
        
    return parsed_report

In [62]:
def get_abstract_fields(report:BeautifulSoup, labels:List[str]) -> dict:
    result = {}
    for element in report.find_all('abstracttext'):
        label = element.get('label')
        if label in labels:
            result[label] = element.text
    return result

In [48]:
def get_mech_tags(report:BeautifulSoup) -> dict:
    result = {}
    mesh = report.find('mesh')
    for element in mesh.children:
        if isinstance(element, NavigableString):
            continue
        if isinstance(element, Tag):
            if not element.name in result:
                result[element.name] = []
            result[element.name].append(element.text)
    return result

In [49]:
def get_mesh_label(mesh:dict) -> int:
    if 'major' in mesh and 'normal' in mesh['major']:
        return 0
    else:
        return 1

In [59]:
def parse_dataset(path:Path)-> pd.DataFrame:
    
    # Check if dataset path exists
    if not path.exists():
        return None
    
    columns=['FINDINGS', 'IMPRESSION', 'MeSH', 'LABEL']
    
    dfs = []
    
    # Get the list of all files
    files = path.glob('**/*')
    
    for file in files:
        
        # Process only files
        if not file.is_file():
            continue
            
        report = parse_report(file)
        
        # Parse Abstract section
        abstract = get_abstract_fields(report, ['FINDINGS', 'IMPRESSION'])
        
        findings = abstract['FINDINGS'] if 'FINDINGS' in abstract.keys() else ''
        impression = abstract['IMPRESSION'] if 'IMPRESSION' in abstract.keys() else ''
        
        # Parse mesh
        mesh = get_mech_tags(report)
        mesh_label = get_mesh_label(mesh)
        
        dfs.append(pd.DataFrame([[findings, impression, mesh, mesh_label]], columns=columns))
        
        
        
    return pd.concat(dfs, ignore_index=True)
    
    

In [51]:
get_mech_tags(parse_report(openi_reports/'1.xml'))

{'major': ['normal']}

In [64]:
get_abstract_fields(parse_report(openi_reports/'1.xml'), ['FINDINGS', 'IMPRESSION'])

{'FINDINGS': 'The cardiac silhouette and mediastinum size are within normal limits. There is no pulmonary edema. There is no focal consolidation. There are no XXXX of a pleural effusion. There is no evidence of pneumothorax.',
 'IMPRESSION': 'Normal chest x-XXXX.'}

In [63]:
data = parse_dataset(openi_reports)

In [66]:
data.head()

Unnamed: 0,FINDINGS,IMPRESSION,MeSH,LABEL
0,The heart and mediastinum are unremarkable. Th...,1. No acute cardiopulmonary disease. 2. Acute ...,{'major': ['Calcified Granuloma/lung/upper lob...,1
1,"The heart, pulmonary XXXX and mediastinum are ...",No acute cardiopulmonary disease.,"{'major': ['Calcified Granuloma/multiple', 'Ca...",1
2,The heart is top normal in size. The mediastin...,No acute disease.,"{'major': ['Atherosclerosis/aorta', 'Aorta/tor...",1
3,Heart size within normal limits. Trachea is mi...,No pulmonary nodules. Negative chest.,{'major': ['Lung/hypoinflation/mild']},1
4,The cardiomediastinal silhouette and pulmonary...,"1. Patchy left lower lobe airspace disease, co...",{'major': ['Opacity/lung/lower lobe/left/poste...,1
