In [1]:
import json
from bs4 import BeautifulSoup

## Loading definitions

In [2]:
with open('keyword-database-export.json') as file:
    content = file.read()
    whole_json = json.loads(content)

In [3]:
definitions = [entry['definition'] for entry in whole_json.values() if 'definition' in entry]
word_definitions = [definition for definition in definitions if 'OfficeDocumentSettings' in definition]
standard_definitions = [definition for definition in definitions if 'OfficeDocumentSettings' not in definition]

print('all:', len(definitions))
print('word:', len(word_definitions))
print('standard:', len(standard_definitions))

all: 1505
word: 89
standard: 1416


# Simple way
Just remote all tags to get plain text - solid way, but we are losing all information about styles

In [32]:
def clean_word_xml(xml_text):
  return BeautifulSoup(xml_text, "lxml").text.strip()

In [38]:
cleaned = clean_word_xml(word_definitions[2])
print(cleaned)

Acetylcholine is synthesized from choline and acetyl coenzyme A, catalyzed by the enzyme O-acetyl transferase. Once synthesized, the acetylcholine molecules are stored in synaptic vesicles in high concentration near the motor endplate.  SNAREs (soluble N-ethylmaleimide-sensitive fusion protein attachment protein receptors) are a multigene family of neuron-specific phosphoproteins and are the most abundant proteins on synaptic vesicles. SNARE proteins are able to interact in vitro with lipid and protein components of synaptic vesicles and with various cytoskeletal proteins, including actin. Synapsin I is a phosphorylated protein found in nerve terminals. Its role is to attach the synaptic vesicles to elements of the cytoskeleton. Synapsin I has a binding site for a calcium/calmodulin-dependent protein kinase II. Phosphorylation of synapsinI decreases its binding affinity for synapsin I to the synaptic vesicle and facilitates the release of acetylcholine.    Synaptotagmin, synaptophysins

# More advanced way
Remove all tags except span, div, p and remove styles starting with mso-

In [39]:
from bs4 import Comment

VALID_TAGS = ['span', 'div', 'p']

def clean_word_xml(xml_text):
  soup = BeautifulSoup(xml_text, "lxml")
  _remove_comments(soup)
  _remove_invalid_tags(soup)
  _fix_remaining_tags(soup)
  return str(soup).strip()

def _remove_comments(soup):
  comments = soup.find_all(string=lambda text: isinstance(text, Comment))
  for c in comments:
    c.extract()

def _remove_invalid_tags(soup):
  tags = soup.select(f":not({','.join(VALID_TAGS)})")
  for tag in tags:
    tag.unwrap()

def _fix_remaining_tags(soup):
  tags = soup.find_all()
  for tag in tags:
    _remove_invalid_style_attrs(tag)
    _remove_class_and_style(tag)

def _remove_invalid_style_attrs(tag):
  if 'style' not in tag.attrs:
    return
  style = tag['style']
  attrs = style.split(';')
  filtered_attrs = [attr for attr in attrs if not attr.strip().startswith('mso')]
  filtered_style = ';'.join(filtered_attrs)
  tag['style'] = filtered_style

def _remove_class_and_style(tag):
  del tag["class"]
  del tag["id"]

In [52]:
definition = word_definitions[31]
cleaned = clean_word_xml(definition)
print(cleaned)

<p style="margin-bottom: .0001pt; line-height: normal;"><span style="font-size: 12.0pt;">Incidence</span><span style="font-size: 12.0pt;"> - Historically, TURP syndrome occurred in 2-15% of patients. Now incidence is lower at around 1.1% with newer surgical techniques.</span></p>
<p style="margin-bottom: .0001pt; line-height: normal;"><span style="font-size: 12.0pt;"> </span></p>
<p style="margin-bottom: .0001pt; line-height: normal;"><span style="font-size: 12.0pt;">Mechanism</span><span style="font-size: 12.0pt;"> – Continuous irrigation is readily absorbed by prostatic tissue due to large venous sinuses. Generally speaking, 10 to 30 cc of solution is absorbed per minute of resection time. The amount absorbed depends on duration of procedure, hydrostatic pressure and magnitude of surgical venous sinuses opening. If monopolar cautery used, LR and normal saline can’t be used and water was often used leading to hyponatremia and volume overload. This is more problematic in patients with 