# XML: Adding structure to unstructured text

Traditionally, there were two main ways researchers added xml tags to unstructured texts:

1. by manually inserting the tags, often with designated XML editors like OxygenXML.

2. programmatically, often using *regular expression* methods. Regular expressions or regexs, allow you to identify different text components by searching for specific text patterns. For example, you could identify footnotes by searching for text lines that begin with numbers followed by a period and a space found near the bottom of a page. The regex for this would look like this: `"^[0-9]+\.\s\w.*$"` which can be translated as follows:

    + ^ = first character of line
    + [0-9]+ matches any number 0-9 that appears 1 or more times (`+`)
    + \. is followed by a period (which needs to be escaped with a back slash as a period is a special wildcard character in reg ex)
    + \w matches any character or number
    + .* followed by any character (`.`) appearing 0 or more times (`*`) until the end of the line (`$`).

Now, we can use Generative AI tools to do this for us. 

In this notebook, we will import an xml file created from a screenplay script and converted to xml using the GPT-4o Mini model within Dartmouth Chat. 




### Exercise 1. 

Develop your own prompt or revise and improve the prompt below. Then upload your digitized script to chat.dartmouth.edu and run this prompt.

``` 
Here is the full transcript of __________________________
Without adding or changing a single word in this transcript, add xml following TEI guidelines for performance texts. Among other things, this TEI-encoded text should identify different elements of front matter (i.e. <title>, <prologue>, <set>, and <castList> tags), the body text (<div> tags identifying each new scene, <head> tags for the title of each scene, <sp> tags to identify dialogue by each speaker with <speaker> tags tagging each speaker’s name, <l> to identify each line break, <stage> tags to identify stage directions, etc.), and back matter (i.e. <epilogue>). The <sp> tags should identify the name of each speaker, which will help if a particular speaker is implied but not named.

```

## 1. Working with XML text

In [None]:
# %pip install beautifulsoup4
# %pip install lxml

In [None]:
from bs4 import BeautifulSoup
from lxml import etree

In [None]:
with open("../../texts/sopranos-partial.xml", encoding='utf-8') as f:
    txt = f.read()

In [None]:
soup = BeautifulSoup(txt, features="xml")
print(soup)

<?xml version="1.0" encoding="utf-8"?>
<TEI xml:lang="en" xmlns="http://www.tei-c.org/ns/1.0" xmlns:hist="http://personal-website-here.com">
<text>
<front>
<title>The Sopranos - Pilot Episode</title>
<set>Locations</set>
<castList>
<castMember>THOMAS SOPRANO</castMember>
<castMember>DR. JENNIFER MELFI</castMember>
<castMember>CARMELA SOPRANO</castMember>
<castMember>MEADOW SOPRANO</castMember>
<castMember>TOMMY JR.</castMember>
<castMember>CHRIS MOLTISANTI</castMember>
<castMember>BIG PUSSY BONPENSIERO</castMember>
<castMember>PAT "PAULIE WALNUTS" GUALTIERI</castMember>
<castMember>DICK BARONE</castMember>
<castMember>ARTIE BUCCO</castMember>
<castMember>CHARMAINE</castMember>
<castMember>LIVIA SOPRANO</castMember>
<castMember>FATHER PHIL</castMember>
<castMember>SILVIO DANTE</castMember>
<castMember>HERMAN "HESH" RABKIN</castMember>
</castList>
</front>
<body>
<div>
<head>INT. DR. MELFI'S WAITING ROOM - DAY</head>
<stage>THOMAS SOPRANO, 40, sits and waits. Uneasily. Staring confusedly