### Overview

The purpose of this notebook is to explore the de-identified sample of Electronic Case Reports (eCR) that were provided by Los Angeles County (LAC) with an eye towards the following objectives:

1. Determine where eCRs tend to look identical, irrespective of patient, provider, or EHR system  
2. Determine where eCRs are most subject to change  
3. Determine if there is a subset of variables that can be used to identify duplicate eCRs with any level of confidence  
4. Determine how eCRs vary across different EHR systems
5. Determine if there is a subset of variables that can be used to minimally define a "document" and a "patient"

### Hypotheses

*Problem Hypothesis:*

- We believe Epidemiologists engage in a time-consuming, manual process of comparing eCRs to one another based on user stories that have been gathered during DIBBs user interviews.  
- We believe STLTs, regardless of which data ingestion and/or case surveillance software they use, experience non-trivial declines in performance of those systems based on information the VIPER team received from Idaho (which uses NBS), as well as the interest in a partnership that we've seen from Chicago and Dallas (both of whom use Salesforce).

*Solution Hypothesis:*

- By identifying the sections of an eCR that contain pertinent information for the purposes of case investigations and the criteria by which a section of data can be determined to be duplicative or redundant, we believe epidemiologists will spend less time manually comparing eCRs, allowing them to spend more time doing the things they do best.

### Environment Setup
The following code imports the necessary libraries for this analysis, as well as unzips all of the data if it has not already be done. It then aggregates the list of directories, which are named the same as the zip file containing the data, which can be used later for iterating over every eCR. Lastly, it creates an eCR class that can be used to perform rudimentary queries on the data.

In [None]:
import os
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from lxml import etree
from pathlib import Path
from xmldiff import main
from zipfile import ZipFile

In [None]:
def extract_zip_files(path):
    for file in os.listdir(path):
        if file.endswith(".zip"):
            new_dir = os.path.join(path, file[:-4])
            if not os.path.exists(new_dir):
                os.makedirs(new_dir)
                with ZipFile(os.path.join(path, file), 'r') as zip_ref:
                    zip_ref.extractall(new_dir)

extract_zip_files("./../data/LAC_DATA")

In [None]:
all_dirs = [d for d in os.listdir("./../data/LAC_DATA") if not d.endswith(".zip")]

In [None]:
class eCR:
    def __init__(self, path):
        self.path = path
        self.tree = etree.parse(path)
        self.root = self.tree.getroot()
        self.ns = self.root.nsmap
        self.ns['default'] = self.ns.pop(None)

class RR(eCR):
    def __init__(self, path):
        super().__init__(path)

class eICR(eCR):
    def __init__(self, path):
        super().__init__(path)
    
    def get_patient_id(self):
        return self.root.find('.//default:patient/default:id', self.ns).attrib['extension']
    
    def get_patient_name(self):
        given = self.root.find('.//default:patient/default:name/default:given', self.ns).text
        family = self.root.find('.//default:patient/default:name/default:family', self.ns).text
        return f"{given} {family}"
    
    def get_patient_dob(self):
        return self.root.find('.//default:patient/default:birthTime', self.ns).attrib['value']
    
    def get_send_date(self):
        time = self.root.find('.//default:effectiveTime', self.ns)
        if time is not None:
            return self._format_date_time(time.attrib.get('value'))
        
    def get_encounter_id(self):
        id = self.root.find('.//default:encounter/default:id', self.ns)
        if id is not None:
            return id.attrib.get('extension')

    def get_set_id(self):
        id = self.root.find('.//default:setId', self.ns)
        if id is not None:
            return id.attrib.get('extension')

    def get_ecr_version(self):
        version = self.root.find('.//default:versionNumber', self.ns)
        if version is not None:
            return version.attrib.get('value') or version.text

    def get_sections(self):
        return self.root.findall('.//default:section', self.ns)
    
    def _format_date_time(self, date_time):
        if date_time is not None:
            return datetime.strptime(date_time, '%Y%m%d%H%M%S%z').isoformat()

In [None]:
section_name_counts = {}

for dir in all_dirs:
    # read in the CDA_eICR.xml file, get the sections, and print the section titles
    file_path = Path("./../data/LAC_DATA/" + dir + "/CDA_eICR.xml")
    ecr = eICR(file_path)
    sections = ecr.get_sections()
    for section in sections:
        section_name = section.find('.//default:title', ecr.ns).text
        if section_name in section_name_counts:
            section_name_counts[section_name] += 1
        else:
            section_name_counts[section_name] = 1

print(f"Found {len(section_name_counts)} unique section names")
for k,v in sorted(section_name_counts.items(), key=lambda x: x[1], reverse=True):
    print(k + ": " + str(v))

In [None]:
section_name_taxonomy = {
    "Allergy Information": [],
    "Patient Information": [],
    "Encounter Information": [],
    "Provider Information": [],
    "Clinical Information": [],
    "Medication Information": [],
    "Immunizations": [],
    "Diagnoses": [],
    "Reason for Visit": [],
    "Care Plan": [],
    "Social History": [],
    "Family History": [],
    "Problems": [],
    "Notes": [],
    "Other": [],
}

for section_name in section_name_counts:
    if "encounter" in section_name.lower():
        section_name_taxonomy["Encounter Information"].append(section_name)
    elif "allergy" in section_name.lower() or "allergies" in section_name.lower():
        section_name_taxonomy["Allergy Information"].append(section_name)
    elif "provider" in section_name.lower():
        section_name_taxonomy["Provider Information"].append(section_name)
    elif "clinical" in section_name.lower():
        section_name_taxonomy["Clinical Information"].append(section_name)
    elif "medication" in section_name.lower():
        section_name_taxonomy["Medication Information"].append(section_name)
    elif "care" in section_name.lower() or "plan" in section_name.lower():
        section_name_taxonomy["Care Plan"].append(section_name)
    elif "social" in section_name.lower():
        section_name_taxonomy["Social History"].append(section_name)
    elif "family" in section_name.lower():
        section_name_taxonomy["Family History"].append(section_name)
    elif "problem" in section_name.lower():
        section_name_taxonomy["Problems"].append(section_name)
    elif "diagnoses" in section_name.lower() or "diagnosis" in section_name.lower():
        section_name_taxonomy["Diagnoses"].append(section_name)
    elif "reason" in section_name.lower():
        section_name_taxonomy["Reason for Visit"].append(section_name)
    elif "immunization" in section_name.lower():
        section_name_taxonomy["Immunizations"].append(section_name)
    elif "problem" in section_name.lower():
        section_name_taxonomy["Problems"].append(section_name)
    elif "note" in section_name.lower():
        section_name_taxonomy["Notes"].append(section_name)
    elif "patient" in section_name.lower():
        section_name_taxonomy["Patient Information"].append(section_name)
    else:
        section_name_taxonomy["Other"].append(section_name)

for k,v in section_name_taxonomy.items():
    print(f"{k}: {len(v)}")
    for section in v:
        print(f"\t{section}")

The above analysis shows how difficult it will be to parse eCR documents as the section names are not standardized. However, they are generally categorizable, which provides some options.

In [None]:
# # Preprocessing
# stop = set(stopwords.words('english'))
# exclude = set(string.punctuation)
# lemma = WordNetLemmatizer()

# def clean(doc):
#     stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
#     punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
#     normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
#     return normalized

# section_names_clean = [clean(doc) for doc in section_name_counts]

# # Convert the section names into a matrix of token counts
# vectorizer = CountVectorizer()
# section_names_term_matrix = vectorizer.fit_transform(section_names_clean)

# # Train LDA model
# lda = LatentDirichletAllocation(n_components=10, random_state=0)
# lda.fit(section_names_term_matrix)

# # Classify section names
# section_names_topics = lda.transform(section_names_term_matrix)

In [None]:
# import numpy as np

# most_likely_topics = np.argmax(section_names_topics, axis=1)
# topic_labels = ["Topic" + str(i) for i in range(lda.n_components)]
# section_name_labels = [topic_labels[i] for i in most_likely_topics]
# topics = {}
# for section_name, label in zip(section_name_counts, section_name_labels):
#     if label in topics:
#         topics[label].append(section_name)
#     else:
#         topics[label] = [section_name]

# for k,v in topics.items():
#     print(f"{k}: {len(v)}")
#     for section in v:
#         print(f"\t{section}")

In [None]:
ecrs_by_patient = {}
for dir in all_dirs:
    file_path = Path("./../data/LAC_DATA/" + dir + "/CDA_eICR.xml")
    ecr = eICR(file_path)
    patient_name = ecr.get_patient_name()
    if patient_name in ecrs_by_patient:
        ecrs_by_patient[patient_name].append(file_path)
    else:
        ecrs_by_patient[patient_name] = [file_path]

for name, count in sorted(ecrs_by_patient.items(), key=lambda x: len(x[1]), reverse=True):
    print(f"{name}: {len(count)}")

In [None]:
def compare_section_names(doc1, doc2):
    ecr1 = eICR(doc1)
    ecr2 = eICR(doc2)
    sections1 = ecr1.get_sections()
    sections2 = ecr2.get_sections()
    section_names1 = [section.find('.//default:title', ecr1.ns).text for section in sections1]
    section_names2 = [section.find('.//default:title', ecr2.ns).text for section in sections2]

    return {
        "Both": set(section_names1).intersection(set(section_names2)),
        "Doc 1 Only": set(section_names1) - set(section_names2),
        "Doc 2 Only": set(section_names2) - set(section_names1)
    }

compare_section_names(ecrs_by_patient['Peter Pan Epictest'][0], ecrs_by_patient['Peter Pan Epictest'][1])

In [None]:
def extract_cell_data(cell):
    if cell.find('.//default:content', ecr.ns) is not None:
        return cell.find('.//default:content', ecr.ns).text
    else:
        return cell.text

def parse_section_content(section):
    # extract table data from the section
    """
    Example data
    <section>
        <templateId root="2.16.840.1.113883.10.20.22.2.10"/>
        <templateId extension="2014-06-09" root="2.16.840.1.113883.10.20.22.2.10"/>
        <code code="18776-5" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Plan of care note"/>
        <title>
            Plan of Treatment
        </title>
        <text>
            <table>
                <caption>
                    Pending Results
                </caption>
                <colgroup>
                    <col width="25%"/>
                    <col width="15%"/>
                    <col width="10%"/>
                    <col span="2" width="25%"/>
                </colgroup>
                <thead>
                    <tr>
                        <th>Name</th>
                        <th>Type</th>
                        <th>Priority</th>
                        <th>Associated Diagnoses</th>
                        <th>Date/Time</th>
                    </tr>
                </thead>
                <tbody>
                    <tr ID="procedure9">
                        <td>
                            <content ID="procedure9name">
                                Drugs Of Abuse Comprehensive Screen, Ur
                            </content>
                        </td>
                        <td>Lab</td>
                        <td>STAT</td>
                        <td/><td>12/23/2022 11:13 AM PST</td>
                    </tr>
                </tbody>
            </table>
            <table>
                <caption>Scheduled Orders</caption>
                <colgroup>
                    <col width="25%"/>
                    <col width="15%"/>
                    <col width="10%"/>
                    <col span="2" width="25%"/>
                </colgroup>
                <thead>
                    <tr>
                        <th>Name</th>
                        <th>Type</th>
                        <th>Priority</th>
                        <th>Associated Diagnoses</th>
                        <th>Order Schedule</th>
                    </tr>
                </thead>
                <tbody>
                    <tr ID="procedure10">
                        <td>
                            <content ID="procedure10name">
                                Drugs Of Abuse Comprehensive Screen, Ur
                            </content>
                        </td>
                        <td>Lab</td>
                        <td>Routine</td>
                        <td/>
                        <td ID="procedure10schedule">
                            One Time for 1 Occurrences starting 12/23/22 until 12/23/22
                        </td>
                    </tr>
                </tbody>
            </table>
    """
    table_data = []
    tables = section.findall('.//default:table', ecr.ns)
    
    for table in tables:
        header = []
        header_row = table.find('.//default:thead/default:tr', ecr.ns)
        if header_row is not None:
            for cell in header_row.findall('.//default:th', ecr.ns):
                header.append(cell.text)
            table_data.append(header)
        for row in table.findall('.//default:tbody/default:tr', ecr.ns):
            row_data = []
            for cell in row.findall('.//default:td', ecr.ns):
                row_data.append(extract_cell_data(cell))
            table_data.append(row_data)
    return table_data


def compare_section_content(doc1, doc2, section_name):
    ecr1 = eICR(doc1)
    ecr2 = eICR(doc2)
    sections1 = ecr1.get_sections()
    sections2 = ecr2.get_sections()
    section1 = None
    section2 = None
    for section in sections1:
        if section.find('.//default:title', ecr1.ns).text == section_name:
            section1 = parse_section_content(section)
            break
    for section in sections2:
        if section.find('.//default:title', ecr2.ns).text == section_name:
            section2 = parse_section_content(section)
            break
    return {"Doc 1": section1, "Doc 2": section2}

compare_section_content(ecrs_by_patient['Peter Pan Epictest'][0], ecrs_by_patient['Peter Pan Epictest'][1], "Plan of Treatment")

TODO: update the code above to handle multiple tables, as well as account for when there is a `<content>` tag in the data

In [None]:
def print_section_content_by_doc(patient_name, section_names):
    if not isinstance(section_names, list):
        section_names = [section_names]
        
    section_content_by_doc = {}
    for doc in ecrs_by_patient[patient_name]:
        ecr = eICR(doc)
        send_date = ecr.get_send_date()
        sections = ecr.get_sections()
        version_number = ecr.get_ecr_version()
        encounter_id = ecr.get_encounter_id()
        set_id = ecr.get_set_id()
        for section in sections:
            section_name = section.find('.//default:title', ecr.ns).text
            if section_name in section_names:
                section_content_by_doc[f"{doc} - {send_date}"] = parse_section_content(section)

    for doc, content in section_content_by_doc.items():
        print(f"{doc} - {encounter_id}:{set_id}:{version_number}")
        for table in content:
            print(table)
        print("\n")

In [None]:
print_section_content_by_doc('Peter Pan Epictest', 'Plan of Treatment')

In [None]:
print_section_content_by_doc(
    'Peter Pan Epictest', 
    ['Encounter Details', 'Encounters', 'Encounter', 'ENCOUNTERS']
)

In [None]:
encounter_statistics = {}

for dir in all_dirs:
    file_path = Path("./../data/LAC_DATA/" + dir + "/CDA_eICR.xml")
    ecr = eICR(file_path)
    encounter_id = ecr.get_encounter_id()
    set_id = ecr.get_set_id()
    version_number = ecr.get_ecr_version()

    if encounter_id in encounter_statistics:
        encounter_statistics[encounter_id]["file_paths"].append(file_path)
        encounter_statistics[encounter_id]["set_ids"].append(set_id)
        encounter_statistics[encounter_id]["version_numbers"].append(version_number)
    else:
        encounter_statistics[encounter_id] = {
            "file_paths": [file_path],
            "set_ids": [set_id],
            "version_numbers": [version_number]
        }

for encounter_id, data in encounter_statistics.items():
    print(f"{encounter_id}:")
    print(f"\tNumber of files: {len(data['file_paths'])}")
    print(f"\tNumber of unique set IDs: {len(set(data['set_ids']))}")
    print(f"\tNumber of unique version numbers: {len(set(data['version_numbers']))}")

In [None]:
section_names_by_assigning_authority_name = {}
for dir in all_dirs:
    file_path = Path("./../data/LAC_DATA/" + dir + "/CDA_eICR.xml")
    ecr = eICR(file_path)
    sections = ecr.get_sections()
    assigning_authority_name = ecr.root.find('.//default:setId', ecr.ns)
    
    if assigning_authority_name is None:
        continue
    
    assigning_authority_name = assigning_authority_name.attrib.get('assigningAuthorityName')
    for section in sections:
        section_name = section.find('.//default:title', ecr.ns).text
        
        if assigning_authority_name in section_names_by_assigning_authority_name:
            section_names_by_assigning_authority_name[assigning_authority_name].append(section_name)
        else:
            section_names_by_assigning_authority_name[assigning_authority_name] = [section_name]

for k,v in section_names_by_assigning_authority_name.items():
    print(f"{k}: {len(set(v))}")
    for section in set(v):
        print(f"\t{section}")

In [None]:
for dir in all_dirs:
    ecr_fpath = Path("./../data/LAC_DATA/" + dir + "/CDA_eICR.xml")
    new_ecr_fpath = str(ecr_fpath).replace(".xml", "_formatted.xml")
    if not os.path.exists(new_ecr_fpath):
        ecr = eICR(ecr_fpath)
        formatted_xml = etree.tostring(ecr.tree, pretty_print=True).decode('utf-8')
        with open(new_ecr_fpath, 'w') as f:
            f.write(formatted_xml)
    
    rr_fpath = Path("./../data/LAC_DATA/" + dir + "/CDA_RR.xml")
    new_rr_fpath = str(rr_fpath).replace(".xml", "_formatted.xml")
    if not os.path.exists(new_rr_fpath):
        rr = RR(rr_fpath)
        formatted_xml = etree.tostring(rr.tree, pretty_print=True).decode('utf-8')
        with open(new_rr_fpath, 'w') as f:
            f.write(formatted_xml)

In [None]:
def generate_diff_heatmap(patient_name):
    ecrs = [str(fpath).replace(".xml", "_formatted.xml") for fpath in ecrs_by_patient[patient_name]]
    n = len(ecrs)
    differences = [[0] * n for _ in range(n)]
    for i in range(n):
        for j in range(i+1, n):
            diff = main.diff_files(ecrs[i], ecrs[j])
            differences[i][j] = len(diff)
            differences[j][i] = len(diff)
    sns.heatmap(differences, cmap='coolwarm')
    plt.xticks(ticks=[i + 0.5 for i in range(n)], labels=range(1, n+1))
    plt.yticks(ticks=[i + 0.5 for i in range(n)], labels=range(1, n+1))
    plt.xlabel('eCR Number')
    plt.ylabel('eCR Number')
    plt.show()

#generate_diff_heatmap('Peter Pan Epictest')

# Analyze Notes

In [None]:
ecrs_by_patient.keys()

In [None]:
# extract all of the text from any section in the eICR that mentions "Notes" in the title
def extract_notes_text(patient_name):
    notes_text_count = {}
    for doc in ecrs_by_patient[patient_name]:
        ecr = eICR(doc)
        sections = ecr.get_sections()
        for section in sections:
            section_name = section.find('.//default:title', ecr.ns).text
            if "note" in section_name.lower():
                print(f"Found a notes section: {section_name}")
                notes_content = section.find('.//default:text/default:content', ecr.ns)
                text = notes_content.text.replace('\n', '').strip() if notes_content is not None else None
                if text is not None:
                    if section_name in notes_text_count:
                        if text in notes_text_count[section_name]:
                            notes_text_count[section_name][text] += 1
                        else:
                            notes_text_count[section_name][text] = 1
                    else:
                        notes_text_count[section_name] = {text: 1}
    return notes_text_count

In [None]:
for patient in ecrs_by_patient:
    notes_text = extract_notes_text(patient)
    if notes_text:
        print(f"{patient}:")
        for section_name in notes_text:
            print(f"\t{section_name}:")
            for note, count in notes_text[section_name].items():
                print(f"\t\t{note}: {count}")
        print("\n")

In [None]:
# for each notes section, extract the raw XML starting with the <section> tag and ending with the </section> tag
def extract_notes_xml(patient_name):
    notes_xml = []
    for doc in ecrs_by_patient[patient_name]:
        ecr = eICR(doc)
        sections = ecr.get_sections()
        for section in sections:
            section_name = section.find('.//default:title', ecr.ns).text
            if "note" in section_name.lower():
                xml = etree.tostring(section,  pretty_print=True).decode('utf-8')
                notes_xml.append(xml)
    return notes_xml

# print each notes section in formatted XML
def print_notes_xml(patient_name):
    notes_xml = extract_notes_xml(patient_name)
    for xml in notes_xml:
        print(xml)
        print("\n")

In [None]:
print_notes_xml('Adam Test')

### How many conditions are represented in the data?

1. How many repeated results are we getting with the same day/time across different eICRs?
2. What's the minimum definition of a "document"?
3. What's the minimum definition of a "patient record"?
4. According to the definitions in 2. and 3., how many duplicates do we have?

Things to consider adding to the Research Plan (not exact quotes or findings):
"Based on the LAC data, we found that 20% of documents were duplicated. We hypothesize that this will generalize across other STLTs as well."
"Hypothesis: The definition of a document is <insert definition>. We are looking for these fields: <insert field>"
