# Working with xml / html

You may be familiar with markup languages. They allow us to add tags to a document that contain information about how the document is to be displayed, formatted, or processed. 

HTML - HyperText Markup Language - is the standard markup language for webpages. The html for a basic webpage may look something like this:

```
<!DOCTYPE HTML>
<html>
	<head>
		<title>[Website Title]</title>
        <link rel="stylesheet" href="link/to/stylesheet.css" />
    </head>
    <body>
        <div type = "section">
            <h1>[Website Title]</h1>
            <p>[A paragraph of content.........]</p>
        </div>
    </body>
</html>
```

In the example above, all text surround by <brackets> are *tags* and additional information within tags are attributes (see "type = 'section'" in the div tag above).

Since the goal of html is to display data, HTML is static with pre-defined tags.

Meanwhile, the other popular markup language, XML - Extensible Markup Language - is designed to store and transfer data and thus allows the user to create their own tags. 

XML may be used to transfer data from one database to another. It may be used to store data. 

**Most importantly, for text analysis, xml can be used to encode text documents, tagging both structural elements (chapters, pages, sections, paragraphs, footnotes, etc.) and content (place and person names, dates, etc.). By encoding texts with xml, we can treat them like searchable databases. For example, if you have a corpus of medical journal articles, you may want to quickly search through the abstracts of each article and, if it fulfils a given search criteria, return the title, author, and date. Or you may choose to search through and extract the citations of each article only. Having these articles encoded with xml makes this easy.**



## xml as a data storage

Sometimes, xml documents are created from scratch to record new information. Take for example, the document below:

In [54]:
### run this code to read in this document
### note: the triple quotes (""") signify that the string continues across multiple lines until closed with a triple quote.
patient_records = """<?xml version='1.0' encoding='utf-8'?>
<xml>
    <head>
        <title>Patient Records for Dr. Who</title>
        <hospital>Mercy Hospital</title>
        <doctor empID = "2674AX"><surname>Who</surname></doctor>
    </head>
    <body>
        <patient>
            <idnum>00866</idnum>
            <name>
                <prefix>Ms.</prefix>
                <suffix/>
                <surname>Washington</surname>
                <firstname>María</firstname>
            </name>
            <gender genderCode = "1">female</gender>
            <birthdate when="1964-05-22">May 22, 1964</birthdate>
            <birthplace>Little Rock, AR</birthplace>
        </patient>

        <patient>
            <idnum>00982</idnum>
            <name>
                <prefix></prefix>
                <suffix/>
                <surname>Brown</surname>
                <firstname>Nomi</firstname>
            </name>
            <gender genderCode = "2">non-binary / other</gender>
            <birthdate when="1987-04-13">April 13, 1987</birthdate>
            <birthplace>Boston, MA</birthplace>
        </patient>

        <patient>
            <idnum>00376</idnum>
            <name>
                <prefix>Dr.</prefix>
                <suffix/>
                <surname>Sánchez-Barillas</surname>
                <firstname>Amos</firstname>
            </name>
            <gender genderCode = "0">male</gender>
            <birthdate when="1999-12-31">Dec 31, 1999</birthdate>
            <birthplace>Nashua, NH</birthplace>
        </patient>
    </body>
</xml>
"""

## I. Parsing xml from one xml document with BeautifulSoup

We can parse the above xml document using **Beautiful Soup**, a Python library designed to extract data from HTML and XML files. We will use it to search for, export, and analyze data from some xml files. [Click here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#xml) to view the documentation for BeautifulSoup.

1. Let's import BeautifulSoup and all its subpackages:

In [2]:
from bs4 import *

2. To parse our xml document of patient records above, we need to read it into BeautifulSoup and specify that we are parsing xml:

In [55]:
soup = BeautifulSoup(patient_records, "xml")
print(soup)

<?xml version="1.0" encoding="utf-8"?>
<xml>
<head>
<title>Patient Records for Dr. Who</title>
<hospital>Mercy Hospital</hospital>
<doctor empID="2674AX"><surname>Who</surname></doctor>
</head>
<body>
<patient>
<idnum>00866</idnum>
<name>
<prefix>Ms.</prefix>
<suffix/>
<surname>Washington</surname>
<firstname>María</firstname>
</name>
<gender genderCode="1">female</gender>
<birthdate when="1964-05-22">May 22, 1964</birthdate>
<birthplace>Little Rock, AR</birthplace>
</patient>
<patient>
<idnum>00982</idnum>
<name>
<prefix/>
<suffix/>
<surname>Brown</surname>
<firstname>Nomi</firstname>
</name>
<gender genderCode="2">non-binary / other</gender>
<birthdate when="1987-04-13">April 13, 1987</birthdate>
<birthplace>Boston, MA</birthplace>
</patient>
<patient>
<idnum>00376</idnum>
<name>
<prefix>Dr.</prefix>
<suffix/>
<surname>Sánchez-Barillas</surname>
<firstname>Amos</firstname>
</name>
<gender genderCode="0">male</gender>
<birthdate when="1999-12-31">Dec 31, 1999</birthdate>
<birthplace>N

3. XML is a nested language, meaning each element or tag must be contained within a hierarchy of elements or tags. To better represent the nested nature of our xml document we can "pretty print" it.

In [12]:
print(soup.prettify())

<?xml version="1.0" encoding="utf-8"?>
<xml>
 <head>
  <title>
   Patient Records for Dr. Who
  </title>
  <hospital>
   Mercy Hospital
  </hospital>
  <doctor empID="2674AX">
   Who
  </doctor>
 </head>
 <body>
  <patient>
   <idnum>
    00866
   </idnum>
   <name>
    <prefix>
     Ms.
    </prefix>
    <suffix/>
    <surname>
     Washington
    </surname>
    <firstname>
     María
    </firstname>
   </name>
   <gender genderCode="1">
    female
   </gender>
   <birthdate when="1964-05-22">
    May 22, 1964
   </birthdate>
   <birthplace>
    Little Rock, AR
   </birthplace>
  </patient>
  <patient>
   <idnum>
    00982
   </idnum>
   <name>
    <prefix/>
    <suffix/>
    <surname>
     Brown
    </surname>
    <firstname>
     Naom
    </firstname>
   </name>
   <gender genderCode="2">
    non-binary / other
   </gender>
   <birthdate when="1987-04-13">
    April 13, 1987
   </birthdate>
   <birthplace>
    Boston, MA
   </birthplace>
  </patient>
  <patient>
   <idnum>
    0037

4. We can quickly parse and pull out information from this text (now saved as "soup") using the following commands. Below, we will extract entire elements (tag + text) as well as just the text contained within a tag.

Please see the [BeautifulSoup webpage](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#xml) for more information and examples.

In [56]:
print(soup.head)
print(soup.head.text)
print(soup.head.title)
print(soup.head.title.text)

<head>
<title>Patient Records for Dr. Who</title>
<hospital>Mercy Hospital</hospital>
<doctor empID="2674AX"><surname>Who</surname></doctor>
</head>

Patient Records for Dr. Who
Mercy Hospital
Who

<title>Patient Records for Dr. Who</title>
Patient Records for Dr. Who


5. Let's see what happens when we ask for a tag that doesn't exist:

In [57]:
print(soup.author.text)

AttributeError: 'NoneType' object has no attribute 'text'

5b. To avoid such errors interrupting our code, we can write an if statement such as:

In [17]:
author = soup.find("author")
if author: 
    print(author.text)
else:
    print("No author tag")

No author tag


5c. We can also write a **try / except** statement to catch errors

In [64]:
try:
    author = soup.find("author")
    print(author.text)
except AttributeError:
    print("there is no author listed")

there is no author listed


6. If there are multiple instances of a tag, we can use **find_all** to retrieve all instances (**find** just retrieves the first instance).

In [22]:
soup.find_all("surname")

[<surname>Who</surname>,
 <surname>Washington</surname>,
 <surname>Brown</surname>,
 <surname>Sánchez-Barillas</surname>]

6b. In the case above, we may just want the surnames of patients and not the doctor. Since patient records are stored separately in the body tag we can simply:

In [66]:
body = soup.body
body.find_all("surname")

[<surname>Washington</surname>,
 <surname>Brown</surname>,
 <surname>Sánchez-Barillas</surname>]

7. We can then iterate through each patient and retrieve information about each.

In [67]:
patients = body.find_all("patient")
print(len(patients))

3


In [29]:
patient_list = []

for patient in patients:
    pat_id = patient.idnum.text
    pat_first = patient.firstname.text
    pat_last = patient.surname.text
    pat_prefix = patient.prefix.text
    pat_suffix = patient.suffix.text
    pat_fullname = pat_prefix + pat_first + pat_last + pat_suffix
    pat_birthyear = patient.birthdate['when'].text[:4]
    pat_birthstate = patient.birthplace.text[-2:]
    patient_list.append(pat_id, pat_fullname, pat_birthyear, pat_birthstate)    

AttributeError: 'NoneType' object has no attribute 'text'

8. You may notice an **AttributeError** above. This occurs when one of the tags, attributes, or other items we are looking for do not exist. One way to avoid this is to wrap each instruction in **try / except** statements. For example:


In [68]:
patient_list = []

for patient in patients:
    try:
        pat_id = patient.idnum.text
    except AttributeError:
        pat_id = ""
    try:
        pat_first = patient.firstname.text
        pat_last = patient.surname.text
    except AttributeError:
        print("missing required info: names")
        continue
    try:
        pat_prefix = patient.prefix.text
    except AttributeError:
        pat_prefix = ""
    try:
        pat_suffix = patient.suffix.text
    except AttributeError:
        pat_suffix = ""
    pat_fullname = ' '.join([pat_prefix, pat_first, pat_last, pat_suffix])
    try:
        pat_birthyear = patient.birthdate['when'][:4]
    except AttributeError:
        pat_birthyear = ""
    try:
        pat_birthstate = patient.birthplace.text[-2:]
    except AttributeError:
        pat_birthstate = ""
    patient_list.append([pat_id, pat_fullname, pat_birthyear, pat_birthstate])      

In [69]:
print(patient_list)

[['00866', ' María Washington ', '1964', 'AR'], ['00982', ' Nomi Brown ', '1987', 'MA'], ['00376', ' Amos Sánchez-Barillas ', '1999', 'NH']]


In [73]:
import pandas as pd
patient_df = pd.DataFrame(patient_list, columns = ["idnum", "full_name", "birth_year", "birth_state"])
patient_df.head()

Unnamed: 0,idnum,full_name,birth_year,birth_state
0,866,María Washington,1964,AR
1,982,Nomi Brown,1987,MA
2,376,Amos Sánchez-Barillas,1999,NH


<h2>Exercise for Part I</h2>

9. Create a dataframe for our patients just like we did above. But, this time, include columns for the patients birth month and initials.

## Part II: Parsing xml across a corpus of texts

10. Let's open a small corpus of xml documents:

In [76]:
import pathlib
from pathlib import Path

In [77]:
#Path("~/shared/RR-workshop-data/lts_xml").expanduser().exists()
xmldir = Path("~/shared/RR-workshop-data/lts_xml").expanduser()

print(xmldir)


C:\Users\F0040RP\shared\RR-workshop-data\lts_xml


11. Using the directory above, we are going to import xml documents created from Holocaust survivor testimony stored on the website [Let Them Speak](https://dhlab.yale.edu/projects/let-them-speak/), which was drawn from testimony recorded by the United States Holocaust Memorial Museum and the Shoah Foundation.

**The subject matter in this dataset is heavy stuff. But, since I worked with a Holocaust Studies group recently, it is what I have readily available in xml (my other option was my corpus of documents from colonial Peru, also heavy subject matter and in Spanish). Anyways, wee will not be reading these texts, but instead just pulling out some information from them.**

In [80]:
pathlist = sorted(xmldir.glob("*.xml")) 
print([path.name for path in pathlist])

['RG-50.030-0015.xml', 'RG-50.030-0060.xml', 'RG-50.030-0072.xml', 'RG-50.030-0082.xml', 'RG-50.030-0083.xml', 'RG-50.030-0145.xml', 'RG-50.030-0198.xml', 'RG-50.030-0210.xml', 'RG-50.030-0234.xml', 'RG-50.030-0315.xml', 'RG-50.030-0316.xml', 'RG-50.030-0546.xml', 'RG-50.031-0075.xml', 'RG-50.106-0118.xml', 'RG-50.106-0139.xml', 'RG-50.165-0016.xml', 'RG-50.165-0123.xml', 'RG-50.233-0036.xml', 'RG-50.322-0005.xml', 'RG-50.462-0081.xml', 'RG-50.462-0102.xml', 'RG-50.470-0008.xml', 'RG-50.471-0013.xml', 'RG-50.471-0013b.xml', 'RG-50.549.02-0054.xml', 'RG-50.562-0005.xml']


12. Let's just see what one of these texts looks like, by examining what tags are found in it.


In [90]:
with open(pathlist[0], encoding = 'utf-8') as f:
    example_doc = f.read()
ex_soup = soup
children_list = []
for child in soup.recursiveChildGenerator():
    if child.name:
        children_list.append(child.name)

print(set(children_list))

{'xml', 'front', 'question', 'answer', 'p', 'body', 'text', 'back'}


12b. We can also quickly print out the front tag info from this text:

In [95]:
print(ex_soup.front)

<front>
<p>Cite as: "Oral history interview with Mark Moskovitz, USHMM" in: Gabor M. Toth, In Search of the Drowned: Testimonies and Testimonial Fragments of the Holocaust (Yale Fortunoff Archive, 2021), lts.fortunoff.library.yale.edu.</p>
<p>Shelfmark: USHMM RG-50.562*0005</p>
<p>Provenance: The United States Holocaust Memorial Museum Oral History Branch conducted the interview with Mark Moskovitz on October 4, 2004. The interview was conducted as part of the United States Holocaust Memorial Museum Oral History Project with David Boder Interviewees as a follow up to Boder's 1946 interviews. The interview was received by the United States Holocaust Memorial Museum Archives Branch in October 2004.</p>
<p>Interview Summary</p>
<p>Mark Moskovitz discusses his experiences during the Holocaust and his life since 1946 when he was interviewed as a displaced person by American psychology professor, David P. Boder</p>
<p>Oral history interview with Mark Moskovitz</p>
<p/>
<p>United States Holoc

13. As you notice, this text includes "question" and "answer" tags to reflect this testimony was acquired through an interview. 

I wondered, when first perusing these testimonies, what I could learn from just reviewing the questions asked. In particular, I wanted to know: did the interviewers have a set of questions planned ahead of time, or did they improvise follow-up questions based on what the interviewees told them? (or was it somewhere in between?)

To get a quick overview of these texts without getting into the serious nature of the content, let's create a list and fill it with all questions asked across these 26 testimonies (setting answers aside for now).

In [81]:
Qlist = []

for path in pathlist:
    with open(path) as f:
        xmldoc = f.read()
    soup = BeautifulSoup(xmldoc, "xml")
    body = soup.body
    questions = body.find_all("question")
    if len(questions) > 0:
        print(f"The document '{path.stem}' has {len(questions)} questions.")
    else: 
        print(f"The document '{path.stem}' has NO questions.")
    Qlist.extend([question.text for question in questions])


The document 'RG-50.030-0015' has 25 questions.
The document 'RG-50.030-0060' has 13 questions.
The document 'RG-50.030-0072' has 25 questions.
The document 'RG-50.030-0082' has 176 questions.
The document 'RG-50.030-0083' has 24 questions.
The document 'RG-50.030-0145' has 17 questions.
The document 'RG-50.030-0198' has 10 questions.
The document 'RG-50.030-0210' has 40 questions.
The document 'RG-50.030-0234' has 132 questions.
The document 'RG-50.030-0315' has NO questions.
The document 'RG-50.030-0316' has NO questions.
The document 'RG-50.030-0546' has 162 questions.
The document 'RG-50.031-0075' has NO questions.
The document 'RG-50.106-0118' has 185 questions.
The document 'RG-50.106-0139' has 207 questions.
The document 'RG-50.165-0016' has NO questions.
The document 'RG-50.165-0123' has NO questions.
The document 'RG-50.233-0036' has 275 questions.
The document 'RG-50.322-0005' has 253 questions.
The document 'RG-50.462-0081' has NO questions.
The document 'RG-50.462-0102' has

14. Now, let's peruse some of the questions found in this text.

In [96]:
print(len(Qlist))
print(Qlist[:20])
print("...")
print(Qlist[-20:])

2603
['Please tell us your name, where you were born, and when you were born?', 'Tell us about your family and your childhood before the war.', 'At the border?', 'Tell us what happened to you and your mother after your brother was taken.', 'Who...who came to your mother?', 'What was life like daily in the convent?', 'And what did you do during the day?', 'How did you get along with the other children?', 'Did you see your mother often? How did you get from the convent to the Protestant community.', 'Tonsils?', 'Had you had no contact with your father at this point?', 'What happened after the liberation?', 'You said your husband was in high school with you.', "No. That's what I wondered.", 'Okay.', 'Yes. Tell me about the...the early years after the war.', 'Have you...uh...made contact with your family.', 'Do you have family here?', 'All right. Thank you. Was it so hard as you thought?', 'And that was the first you knew that?']
...
['Do you remember his last name?', 'Related?', 'What’s h

In [104]:
# we can also extract 20 randomly selected questions from this list
import random
random.choices(Qlist, k = 20)

['After the war, you mean?',
 'How long did you stay there, do you remember? Not very long?',
 '-- that you came across to France from Italy, you walked. But before you tell me the rest of the story, I would like to go back and ask you a few questions about your --',
 'Where did you come to?',
 "Abe, can you go back for just a minute because it wasn't clear. You were in Flossenberg, in the subcamp of Flossenberg. When were you taken to Therenienstadt.",
 'You saw that?',
 'OK! When you were in the University, there were so many leading philosophers who were anti-Semitic in terms of their writings.',
 '-- you think, when you went into the ghetto --',
 'The tears of the mothers?',
 'Right.',
 'Right.',
 'Tell me more about it, tell me more about that.',
 'Som-Somebody just walked in, let’s take a break. [tape break] Okay, we are back.',
 'Did you want -- had -- did you want children?',
 'Wait, take your glasses off and you have to sit back. Now tell me about that picture.',
 'Was there p

As you can see, these questions to do not appear to be planned in advance. There also appear to be some answers and statements that appear to have been erroneously tagged as questions.