This script aims to extract the case summary / case report tag from the NXMLs, in order to determine diagnosis similartiy based on generated queries

In [1]:
import json


In [2]:
with open ('/cs/labs/tomhope/yuvalbus/pmc/pythonProject/largeListsGuy/matching_uids_xml_paths.json', 'r') as f:
    matching_uids_xml_paths = json.load(f)
matching_uids_xml_paths[0:1]

[['/cs/labs/tomhope/yuvalbus/pmc/pythonProject/data2/PMC8167975/8167975_1/12886_2021_Article_2004.nxml',
  '/cs/labs/tomhope/yuvalbus/pmc/pythonProject/data2/PMC5563556/5563556_1/TJO-47-243.nxml']]

In [3]:
from lxml import etree
from multiprocessing import Pool

# Create an XPath expression that matches <sec> elements with a <title> containing any of the keywords.
# Using translate() ensures a case-insensitive search.
xpath_expr = (
    "//sec["
    "  contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'case presentation') "
    "or contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'case report') "
    "or contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'case description') "
    "or contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'case summary')"
    "or contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'patient presentation') "
    "or contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'patient report') "
    "or contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'patient description') "
    "or contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'patient summary') "
    "or contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'case') "
    "or (normalize-space(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')) = 'patient') "
    "]"
)


def process_file_list(file_list):
    """
    Given a list of file paths, parse each file, run the XPath,
    gather the text from <sec> paragraphs, return a dict { filepath: text }.
    """
    single_pair_case_summary = {}
    for file in file_list:
        tree = etree.parse(file)
        sections = tree.xpath(xpath_expr)
        for sec in sections:
            paragraphs = sec.xpath(".//p//text()")
            text = " ".join(paragraphs)
            single_pair_case_summary[file] = text
    return single_pair_case_summary

def multiprocess_case_summaries():
    with Pool() as pool:
        # Map each sub-list to a worker process
        case_summary_pairs_list = pool.map(process_file_list, matching_uids_xml_paths)

    return case_summary_pairs_list


case_summary_pairs_list = multiprocess_case_summaries()



In [4]:
case_summary_pairs_list = [list(case_summary_pairs_list[idx].values()) for idx in range(len(case_summary_pairs_list))]

In [5]:
def calc_indices_summaries_list(case_summary_pairs_list):
    """
    Returns two lists:
      - A list of indices corresponding to elements depicting both cases.
      - A list of indices corresponding to elements depicting less than 2 cases.
    """
    valid_indices = [idx for idx in range(len(case_summary_pairs_list)) if len(case_summary_pairs_list[idx]) == 2]
    invalid_indices = [idx for idx in range(len(case_summary_pairs_list)) if len(case_summary_pairs_list[idx]) != 2]

    return valid_indices, invalid_indices



In [6]:
len(calc_indices_summaries_list(case_summary_pairs_list)[1])

83

In [7]:
len(case_summary_pairs_list)

31692

In [9]:
v, i = calc_indices_summaries_list(case_summary_pairs_list)
len(v)

31609

In [None]:
# with open('/cs/labs/tomhope/yuvalbus/pmc/pythonProject/largeListsGuy/case_summary_pairs_list.json', 'w') as f:
#     json.dump(case_summary_pairs_list, f)

with open('/cs/labs/tomhope/yuvalbus/pmc/pythonProject/largeListsGuy/case_summary_pairs_list.json', 'r') as f:
    case_summary_pairs_list = json.load(f)

In [11]:
list(case_summary_pairs_list[0].values())[0]

'A 40-year-old woman who first presented with complaints of decreased vision and metamorphopsia in her right eye was diagnosed with CO 9\u2009years ago. At that time her best corrected visual acuity (BCVA) was 12/20. Both eyes had normal anterior segments. Right fundus examination showed a geographic-shaped, yellowish-white choroidal lesion surrounding the optic disc in the right eye (Fig.\xa0 1 ). B-scan ultrasonography of right eye revealed a typical dense echogenic plaque which causing acoustic shadowing behind (Fig.\xa0 2 ). FFA and ICGA had no evidence of CNV except early hyperfluorescent choroidal filling pattern with late diffuse staining. Computerized tomography (CT) showed a hyperdense choroidal plaque with the same density as bone (Fig.\xa0 3 ). Optical coherence tomography (OCT) demonstrated serous retinal detachments at the initial examination (Fig.\xa0 4 ).\n Fig. 1 Fundus photograph showed a yellowish white, peripapillary and sharply demarcated choroidal lesion involving 

In [95]:
len(case_summary_pairs_list)

31692

In [19]:
case_summary_pairs_list[5]

{'/cs/labs/tomhope/yuvalbus/pmc/pythonProject/data2/PMC7732735/7732735_1/cureus-0012-00000011434.nxml': 'A 40-year-old African American female with a past medical history of hypertension and lupus profundus was referred to the gastroenterology clinic by her primary physician for endoscopic evaluation of a one-year history of intermittent left lower quadrant pain, constipation, and hematochezia. The patient described noticing a bulge located in her left lower quadrant when symptomatic that would subsequently disappear after having a bowel movement. She often needed to massage the area and change position to facilitate a bowel movement. Upon further questioning, the patient denied use of nonsteroidal anti-inflammatory drugs (NSAIDs), opiates or other medications known to cause constipation. She denied any personal or family history of colon cancer. Her physical exam, including a digital rectal exam, was unremarkable.\xa0Laboratory tests including hemoglobin and hematocrit were also withi

In [74]:
case_summary_pairs_list[0:10]

[['The authors present a 40-year-old woman diagnosed with choroidal osteoma of the right eye. Her best corrected visual acuity was 12/20 but decreased to 5/20 due to secondary choroidal neovascularization after 8\u2009years follow up. Fundus examination revealed an enlarged choroidal osteoma in most margins at posterior pole with schistose hemorrhage beside macula. Optical coherence tomography angiography revealed unique features in the vascular changes of choroidal neovascularization in choroidal osteoma in the outer retinal layer and choroid capillary layers, and subretinal neovascularization. Indocyanine green fluorescence angiography showed there was hypo-fluorescence at the peripapillary with faint hyper-fluorescence at the macular, corresponding to the location on the fundus photograph. The patient received 3 injections of intravitreal ranibizumab. After 1\u2009year follow up, her visual acuity of the right eye was 18/20 and the CNV had regressed.',
  'A 40-year-old woman who fir