<a href="https://colab.research.google.com/github/Echevarriaj93/gradwork2022/blob/main/Jose_Echevarria_xml_parsing_p1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python XML Tutorial: Beginner's Guide

OP from [here](https://www.datacamp.com/tutorial/python-xml-elementtree) and [here](http://www2.hawaii.edu/~takebaya/cent110/xml_parse/xml_parse.html)

Modified by Dr. Jie Tao



<!--#### TODO: update inpage links-->

<!--; And You'll utilize [xpath expresssions]() to populate XML files-->

As a data scientist, you'll find that understanding XML is powerful for both web-scraping and general practice in parsing a structured document.

In this tutorial, you'll cover the following topics:

- You'll learn more about [XML](https://colab.research.google.com/drive/1jkZGfFy1YZtfbMkW4hipFwRLIRDNsbhg?authuser=2#scrollTo=sf0NTyCJ3X4Y).
- You'll learn how to do simple XML parsing [using `beautifulSoup`](https://colab.research.google.com/drive/1jkZGfFy1YZtfbMkW4hipFwRLIRDNsbhg?authuser=2#scrollTo=k3qCw3LM7sru),and you'll get introduced to the [Python `ElementTree` package](https://colab.research.google.com/drive/1jkZGfFy1YZtfbMkW4hipFwRLIRDNsbhg?authuser=2#scrollTo=Z9nGfj279w_B).
- Then, you'll discover how you can [explore XML trees](https://colab.research.google.com/drive/1jkZGfFy1YZtfbMkW4hipFwRLIRDNsbhg?authuser=2#scrollTo=Exploring_the_XML_Document) to understand the data that you're working with better with the help of ElementTree functions, for loops and XPath expressions.
- Next, you'll learn how you can [find and modify an XML file](https://colab.research.google.com/drive/1jkZGfFy1YZtfbMkW4hipFwRLIRDNsbhg?authuser=2#scrollTo=Find_and_Modify_Data_in_XML).


## What is XML?
XML stands for "Extensible Markup Language". It is mainly used in webpages, where the data has a specific structure and is understood dynamically by the XML framework.

XML creates a __tree-like__ structure that is easy to interpret and supports a __hierarchy__ (similar to the HTML tree we saw before). Whenever a page follows XML, it can be called an __XML document__.

XML documents have sections, called elements, defined by a beginning and an ending **tag** (similar to HTML tags). A tag is a markup construct that begins with `<` and ends with `>`. The characters between the start-tag (eg. `<movie>`) and end-tag (eg. `</movie>`), if there are any, are the element's **content**. Elements can contain markup, including other elements, which are called _child elements_.
The largest, top-level element is called the **root**, which contains all other elements.
Attributes are **name–value pairs** that exist within a start-tag or empty-element tag. An XML attribute can only have a **single** value and each attribute can appear at most once on each element.

An example XML document would look like below:
```xml
<?xml version="1.0" ?>
<books>
  <book>
    <title>The Cat in the Hat</title>
    <author>Dr. Seuss</author>
    <price>6.99</price>
  </book>
  <book>
    <title>Ender's Game</title>
    <author>Orson Scott Card</author>
    <price>8.99</price>
  </book>
  <book>
    <title>Prey</title>
    <author>Michael Crichton</author>
    <price>9.35</price>
  </book>
</books>
```


Now we can use what we learned about _web scraping_, to retrieve an XML document online (with `requests`) and then do some simple parsing with it (with `beautifulSoup`).

## Read XML with `BeautifulSoup`

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
xml_url = "https://raw.githubusercontent.com/DrJieTao/ba505-docs/master/data/movies.xml"
response = requests.get(xml_url)
soup = BeautifulSoup(response.text,'xml')
print(soup.prettify())

<?xml version="1.0" encoding="utf-8"?>
<collection>
 <genre category="Action">
  <decade years="1980s">
   <movie favorite="True" title="Indiana Jones: The raiders of the lost Ark">
    <format multiple="No">
     DVD
    </format>
    <year>
     1981
    </year>
    <rating>
     PG
    </rating>
    <description>
     'Archaeologist and adventurer Indiana Jones 
                is hired by the U.S. government to find the Ark of the 
                Covenant before the Nazis.'
    </description>
   </movie>
   <movie favorite="True" title="THE KARATE KID">
    <format multiple="Yes">
     DVD,Online
    </format>
    <year>
     1984
    </year>
    <rating>
     PG
    </rating>
    <description>
     None provided.
    </description>
   </movie>
   <movie favorite="False" title="Back 2 the Future">
    <format multiple="False">
     Blu-ray
    </format>
    <year>
     1985
    </year>
    <rating>
     PG
    </rating>
    <description>
     Marty McFly
    </description>
   </mo

From what you have read above, you see that

- `<collection>` is the single root element: it contains all the other elements, such as `<genre>`, or `<movie>`, which are the child elements or subelements. As you can see, these elements are nested.

**Note** that these child elements can also act as parents and contain their own child elements, which are then called "sub-child elements".

- You'll see that, for example, the `<movie>` element contains a couple of "attributes", such as `favorite` or `title` that give even more information!

We can use `BeautifulSoup` to retrieve some simple information, just like what we did with web scraping.

For instance if we want all the `titles` of `movies`:
- `<movie>` are tags, so we can use `find_all()`;
- `title` is an attribute in each `<movie>`, so the `get()` method can be used.

In [None]:
titles = [movie.get("title") for movie in soup.find_all("movie")]
titles

['Indiana Jones: The raiders of the lost Ark',
 'THE KARATE KID',
 'Back 2 the Future',
 'X-Men',
 'Batman Returns',
 'Reservoir Dogs',
 'ALIEN',
 "Ferris Bueller's Day Off",
 'American Psycho',
 'Batman: The Movie',
 'Easy A',
 'Dinner for SCHMUCKS',
 'Ghostbusters',
 'Robin Hood: Prince of Thieves']

### DO IT YOURSELF

How about get the `year` of each `<movie>`?

In [None]:
#### Must print each movie's release year in a list
years = [y.text for y in soup.find_all('year')] #Get the text in all year tags

moviedict = dict(zip(titles, years)) #Just practice. Combines two list into dict

# soup.find_all('year') This print the years with the tags. Not quite what we want.

moviedict

{'Indiana Jones: The raiders of the lost Ark': '1981',
 'THE KARATE KID': '1984',
 'Back 2 the Future': '1985',
 'X-Men': '2000',
 'Batman Returns': '1992',
 'Reservoir Dogs': '1992',
 'ALIEN': '1979',
 "Ferris Bueller's Day Off": '1986',
 'American Psycho': '2000',
 'Batman: The Movie': '1966',
 'Easy A': '2010',
 'Dinner for SCHMUCKS': '2011',
 'Ghostbusters': '1984',
 'Robin Hood: Prince of Thieves': '1991'}

In [None]:
#@title Solution
years = [movie.year.text for movie in soup.find_all("movie")]
years

['1981',
 '1984',
 '1985',
 '2000',
 '1992',
 '1992',
 '1979',
 '1986',
 '2000',
 '1966',
 '2010',
 '2011',
 '1984',
 '1991']

Now we can create a dict of movies and the respective release years.

In [None]:
#### use try-except to control if we missed titles or years
try:
  assert len(titles) == len(years)
  movie_years = dict(zip(titles, years))
except:
  print("There are {} movie titles but {} year values".format(len(titles), len(years)))
  raise(ValueError("titles and years have different length!"))
movie_years

{'Indiana Jones: The raiders of the lost Ark': '1981',
 'THE KARATE KID': '1984',
 'Back 2 the Future': '1985',
 'X-Men': '2000',
 'Batman Returns': '1992',
 'Reservoir Dogs': '1992',
 'ALIEN': '1979',
 "Ferris Bueller's Day Off": '1986',
 'American Psycho': '2000',
 'Batman: The Movie': '1966',
 'Easy A': '2010',
 'Dinner for SCHMUCKS': '2011',
 'Ghostbusters': '1984',
 'Robin Hood: Prince of Thieves': '1991'}

We can see `BeautifulSoup` is capable of capturing these information we requested. However, if the logic is more complicated, `BeautifulSoup` cannot handle it. That's why we need the native `ElementTree` package.

## Read XML using `ElementTree`

In the XML file provided, there is a basic collection of movies described. The only problem is the data is a mess! There have been a lot of different curators of this collection and everyone has their own way of entering data into the file. The main goal in this tutorial will be to read and understand the file with Python - then fix the problems.


In [None]:
import xml.etree.ElementTree as ET

It will be easier to download the XML document to local.

In [None]:
!wget "https://raw.githubusercontent.com/DrJieTao/ba505-docs/master/data/movies.xml" ./movies.xml

--2022-10-04 01:43:50--  https://raw.githubusercontent.com/DrJieTao/ba505-docs/master/data/movies.xml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5025 (4.9K) [text/plain]
Saving to: ‘movies.xml’


2022-10-04 01:43:50 (40.3 MB/s) - ‘movies.xml’ saved [5025/5025]

--2022-10-04 01:43:50--  http://./movies.xml
Resolving . (.)... failed: No address associated with hostname.
wget: unable to resolve host address ‘.’
FINISHED --2022-10-04 01:43:50--
Total wall clock time: 0.2s
Downloaded: 1 files, 4.9K in 0s (40.3 MB/s)


In [None]:
tree = ET.parse("./movies.xml")
root = tree.getroot()
root

ParseError: ignored

Alternatively, you can read it from the `response` object we created before, using the `fromstring()` method.

In [None]:
# from xml.etree.ElementTree import fromstring, ElementTree
tree = ET.ElementTree(ET.fromstring(response.content, parser = ET.XMLParser(encoding = 'iso-8859-5')))
root = tree.getroot()
root

<Element 'collection' at 0x7fbcb8814170>

We can investigate more on `root`.

In [None]:
root.tag

'collection'

In [None]:
root.attrib

{}

In [None]:
root.text

'\n    '

### Exploring the XML Document

If you do not know the specific tag you want to search for, it never hurts to look at the _children_ of `root`.

In [None]:
for child in root:
    print(child.tag, child.attrib)

genre {'category': 'Action'}
genre {'category': 'Thriller'}
genre {'category': 'Comedy'}


Now you know that the children of the __root__ `collection` are all `genre`. To designate the `genre`, the XML uses the **attribute** `category`. There are `Action`, `Thriller`, and `Comedy` movies according the `genre` element.

Typically it is helpful to know all the elements in the **entire tree**. One useful function for doing that is `root.iter()`. You can put this function into a `for` loop and it will iterate over the entire tree.

__NOTE__:
1. When you use `root.iter()`, it **flattens** the tree structure - so no more children and grandchildren;
2. You should always considering doing this using list/dict comprehension.



In [None]:
[elem.tag for elem in root.iter()]

['collection',
 'genre',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'genre',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'genre',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description']

This gives a general notion for how many elements you have, but it does not show the **attributes** or **levels** in the tree.

There is a helpful way to see the whole document. Any element has a `.tostring()` method. If you pass the root into the `.tostring()` method, you can return the whole document. Within `ElementTree`, `.tostring()` takes a slightly strange form.

Since `ElementTree` is a powerful library that can interpret more than just XML, you must specify both the encoding and decoding of the document you are displaying as the string. For XMLs, use `'utf8'` - this is the typical encoding standard for an XML.

__NOTE__: for other languages, other encoding standard can be used. Refer to the web scraping lesson for more details on this.

In [None]:
print(ET.tostring(root, encoding='utf8').decode('utf8'))

<?xml version='1.0' encoding='utf8'?>
<collection>
    <genre category="Action">
        <decade years="1980s">
            <movie favorite="True" title="Indiana Jones: The raiders of the lost Ark">
                <format multiple="No">DVD</format>
                <year>1981</year>
                <rating>PG</rating>
                <description>
                'Archaeologist and adventurer Indiana Jones 
                is hired by the U.S. government to find the Ark of the 
                Covenant before the Nazis.'
                </description>
            </movie>
               <movie favorite="True" title="THE KARATE KID">
               <format multiple="Yes">DVD,Online</format>
               <year>1984</year>
               <rating>PG</rating>
               <description>None provided.</description>
            </movie>
            <movie favorite="False" title="Back 2 the Future">
               <format multiple="False">Blu-ray</format>
               <year>1985</year>
  

In [None]:
#More practice this time with ratings

ratings = [rating.text for rating in soup.find_all('rating')]

dict(zip(titles, ratings))

{'Indiana Jones: The raiders of the lost Ark': 'PG',
 'THE KARATE KID': 'PG',
 'Back 2 the Future': 'PG',
 'X-Men': 'PG-13',
 'Batman Returns': 'PG13',
 'Reservoir Dogs': 'R',
 'ALIEN': 'R',
 "Ferris Bueller's Day Off": 'PG13',
 'American Psycho': 'Unrated',
 'Batman: The Movie': 'PG',
 'Easy A': 'PG--13',
 'Dinner for SCHMUCKS': 'Unrated',
 'Ghostbusters': 'PG',
 'Robin Hood: Prince of Thieves': 'Unknown'}

### Investigating Specific Elements

You can expand the use of the iter() function to help with finding particular elements of interest. `root.iter()` will list all subelements under the root that match the element specified. Here, you will list all attributes of the `movie` element in the tree:

In [None]:
#### put the tag ("movie") in root.iter()
#### .attrib gives you all the attributes for a given tag
all_movies = [movie.attrib for movie in root.iter("movie")]
all_movies

[{'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'},
 {'favorite': 'True', 'title': 'THE KARATE KID'},
 {'favorite': 'False', 'title': 'Back 2 the Future'},
 {'favorite': 'False', 'title': 'X-Men'},
 {'favorite': 'True', 'title': 'Batman Returns'},
 {'favorite': 'False', 'title': 'Reservoir Dogs'},
 {'favorite': 'False', 'title': 'ALIEN'},
 {'favorite': 'True', 'title': "Ferris Bueller's Day Off"},
 {'favorite': 'FALSE', 'title': 'American Psycho'},
 {'favorite': 'False', 'title': 'Batman: The Movie'},
 {'favorite': 'True', 'title': 'Easy A'},
 {'favorite': 'True', 'title': 'Dinner for SCHMUCKS'},
 {'favorite': 'False', 'title': 'Ghostbusters'},
 {'favorite': 'True', 'title': 'Robin Hood: Prince of Thieves'}]

Many times elements will not have attributes, they will only have text content. Using the attribute `.text`, you can retrieve them.

For instance, to get all the descriptions of the movies.

In [None]:
all_descs = [description.text for description in root.iter('description')]
all_descs

["\n                'Archaeologist and adventurer Indiana Jones \n                is hired by the U.S. government to find the Ark of the \n                Covenant before the Nazis.'\n                ",
 'None provided.',
 'Marty McFly',
 'Two mutants come to a private academy for their kind whose resident superhero team must \n               oppose a terrorist organization with similar powers.',
 'NA.',
 'WhAtEvER I Want!!!?!',
 '"""""""""',
 'Funny movie about a funny guy',
 'psychopathic Bateman',
 'What a joke!',
 'Emma Stone = Hester Prynne',
 'Tim (Rudd) is a rising executive\n                 who т\x80\x9csucceedsт\x80\x9d in finding the perfect guest, \n                 IRS employee Barry (Carell), for his bossт\x80\x99 monthly event, \n                 a so-called т\x80\x9cdinner for idiots,т\x80\x9d which offers certain \n                 advantages to the exec who shows up with the biggest buffoon.\n                 ',
 'Who ya gonna call?',
 'Robin Hood slaying']

We can even add the descriptions from `all_descs` to `all_movies`.

In [None]:
try:
  assert len(all_descs) == len(all_movies)
  for i, desc in enumerate(all_descs):
    all_movies[i]["description"] = desc
except:
  # raise(ValueError())
  print("{} descriptions for {} movies, mismatch".format(len(all_descs), len(all_movies)))
all_movies

[{'description': "\n                'Archaeologist and adventurer Indiana Jones \n                is hired by the U.S. government to find the Ark of the \n                Covenant before the Nazis.'\n                ",
  'favorite': 'True',
  'title': 'Indiana Jones: The raiders of the lost Ark'},
 {'description': 'None provided.',
  'favorite': 'True',
  'title': 'THE KARATE KID'},
 {'description': 'Marty McFly',
  'favorite': 'False',
  'title': 'Back 2 the Future'},
 {'description': 'Two mutants come to a private academy for their kind whose resident superhero team must \n               oppose a terrorist organization with similar powers.',
  'favorite': 'False',
  'title': 'X-Men'},
 {'description': 'NA.', 'favorite': 'True', 'title': 'Batman Returns'},
 {'description': 'WhAtEvER I Want!!!?!',
  'favorite': 'False',
  'title': 'Reservoir Dogs'},
 {'description': '"""""""""', 'favorite': 'False', 'title': 'ALIEN'},
 {'description': 'Funny movie about a funny guy',
  'favorite': 'Tru

### Search with XPath

Above steps are all cool, but XPath is a query language used to search through an XML quickly and easily. XPath stands for XML Path Language and uses, as the name suggests, a "path like" syntax to identify and navigate nodes in an XML document.

Understanding XPath is critically important to scanning and populating XMLs. ElementTree has a `.findall()` function that will traverse the immediate children of the referenced element. You can use XPath expressions to specify more useful searches.

Here, you will search the tree for movies that came out in 1992:

In [None]:
movies_1992 = [movie.attrib for movie in root.findall("./genre/decade/movie/[year='1992']")]
movies_1992

[{'description': 'NA.', 'favorite': 'True', 'title': 'Batman Returns'},
 {'description': 'WhAtEvER I Want!!!?!',
  'favorite': 'False',
  'title': 'Reservoir Dogs'}]

The method `.findall()` always begins at the element specified.

And the XPath `./genre/decade/movie/[year='1992']`:
- First it go through the tree structure `root(.) -> <genre> -> <decade> -> <movie>`
- Then it searches a child of `<movie>`, `<year>` to evaluate if it is `1992`
- if it matches then return the _attributes_ of `<movie>`.

__NOTE__: `.findall()`, like `.find_all()` in `BeautifulSoup`, always return a list. So you may need some kind of loop to go through it.

Besides searching on elements, you can even search on attributes!

Now, retrieve only the movies that are available in `multiple` formats (an attribute).

In [None]:
mf_movies = [movie.attrib for movie in root.findall("./genre/decade/movie/format/[@multiple='Yes']")]
mf_movies

[{'multiple': 'Yes'},
 {'multiple': 'Yes'},
 {'multiple': 'Yes'},
 {'multiple': 'Yes'},
 {'multiple': 'Yes'}]

Looks good, but the only issue is it only returned the value (`yes`) for the attribute `multiple`. What if we want the movie titles instead?

We can use `...` at the end of the XPath to tell it to go to the parent element of the current element.

In [None]:
mf_movies1 = [movie.attrib for movie in root.findall("./genre/decade/movie/format/[@multiple='Yes']...")]
mf_movies1

[{'description': 'None provided.',
  'favorite': 'True',
  'title': 'THE KARATE KID'},
 {'description': 'Two mutants come to a private academy for their kind whose resident superhero team must \n               oppose a terrorist organization with similar powers.',
  'favorite': 'False',
  'title': 'X-Men'},
 {'description': '"""""""""', 'favorite': 'False', 'title': 'ALIEN'},
 {'description': 'What a joke!',
  'favorite': 'False',
  'title': 'Batman: The Movie'},
 {'description': 'Tim (Rudd) is a rising executive\n                 who т\x80\x9csucceedsт\x80\x9d in finding the perfect guest, \n                 IRS employee Barry (Carell), for his bossт\x80\x99 monthly event, \n                 a so-called т\x80\x9cdinner for idiots,т\x80\x9d which offers certain \n                 advantages to the exec who shows up with the biggest buffoon.\n                 ',
  'favorite': 'True',
  'title': 'Dinner for SCHMUCKS'}]

## Find (and Modify) Data in XML

### Find (and Modify) a Specific Element in XML

Instead of finding all elements that fit a criteria (like we did above), we can even search for a specific element using `.find()`.

For instance, if we only want "Back to Future" (a classic!!!), we can:

In [None]:
b2tf = root.find("./genre/decade/movie[@title='Back 2 the Future']")
print(b2tf)

<Element 'movie' at 0x7f45079a0830>


Notice that using the `.find()` method returns an element of the tree. Much of the time, it is more useful to **edit** the content within an element.

__WARNING__: It is NOT the best practice to edit your source data, only do this if you are 1,000% sure, and make copies of your original data!

Modify the title attribute of the `Back 2 the Future` element variable to read `"Back to the Future"`. Then, print out the attributes of your variable to see your change. You can easily do this by accessing the attribute of an element and then assigning a new value to it:

In [None]:
b2tf.attrib["title"] = "Back to the Future"
print(b2tf.attrib)

{'favorite': 'False', 'title': 'Back to the Future', 'description': 'Marty McFly'}


Write out your changes back to the XML so they are permanently fixed in the document. Print out your movie attributes again to make sure your changes worked. Use the .write() method to do this:

In [None]:
tree.write("movies_new.xml")

tree = ET.parse('movies_new.xml')
root = tree.getroot()

for movie in root.iter('movie'):
    print(movie.attrib)

{'description': "\n                'Archaeologist and adventurer Indiana Jones \n                is hired by the U.S. government to find the Ark of the \n                Covenant before the Nazis.'\n                ", 'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'description': 'None provided.', 'favorite': 'True', 'title': 'THE KARATE KID'}
{'description': 'Marty McFly', 'favorite': 'False', 'title': 'Back to the Future'}
{'description': 'Two mutants come to a private academy for their kind whose resident superhero team must \n               oppose a terrorist organization with similar powers.', 'favorite': 'False', 'title': 'X-Men'}
{'description': 'NA.', 'favorite': 'True', 'title': 'Batman Returns'}
{'description': 'WhAtEvER I Want!!!?!', 'favorite': 'False', 'title': 'Reservoir Dogs'}
{'description': '"""""""""', 'favorite': 'False', 'title': 'ALIEN'}
{'description': 'Funny movie about a funny guy', 'favorite': 'True', 'title': "Ferris Bueller's Day O

### Find (and Modify) an Attribute in XML

The `multiple` attribute is incorrect in some places. Use `ElementTree` to fix the designator based on how many formats the movie comes in. First, extract the `format` attribute and text to see which parts need to be fixed.

In [None]:
formats = [(form.attrib, form.text) for form in root.findall("./genre/decade/movie/format")]
formats

[({'multiple': 'No'}, 'DVD'),
 ({'multiple': 'Yes'}, 'DVD,Online'),
 ({'multiple': 'False'}, 'Blu-ray'),
 ({'multiple': 'Yes'}, 'dvd, digital'),
 ({'multiple': 'No'}, 'VHS'),
 ({'multiple': 'No'}, 'Online'),
 ({'multiple': 'Yes'}, 'DVD'),
 ({'multiple': 'No'}, 'DVD'),
 ({'multiple': 'No'}, 'blue-ray'),
 ({'multiple': 'Yes'}, 'DVD,VHS'),
 ({'multiple': 'No'}, 'DVD'),
 ({'multiple': 'Yes'}, 'DVD,digital,Netflix'),
 ({'multiple': 'No'}, 'Online,VHS'),
 ({'multiple': 'No'}, 'Blu_Ray')]

There is some work that needs to be done on this tag.

You can just test if the `.text` contains commas - that will tell whether the multiple attribute should be "Yes" or "No". Adding and modifying attributes can be done easily with the `.set()` method.

Note: we can also use `re`, which is the standard regex interpreter for Python. If you want to know more about regular expressions, consider [this tutorial](https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial).

In [None]:
for form in root.findall("./genre/decade/movie/format"):
    # Search for the commas in the format text

    if ',' in form.text:
        form.set('multiple','Yes')
    else:
        form.set('multiple','No')

formats = [(form.attrib, form.text) for form in root.findall("./genre/decade/movie/format")]
formats

[({'multiple': 'No'}, 'DVD'),
 ({'multiple': 'Yes'}, 'DVD,Online'),
 ({'multiple': 'No'}, 'Blu-ray'),
 ({'multiple': 'Yes'}, 'dvd, digital'),
 ({'multiple': 'No'}, 'VHS'),
 ({'multiple': 'No'}, 'Online'),
 ({'multiple': 'No'}, 'DVD'),
 ({'multiple': 'No'}, 'DVD'),
 ({'multiple': 'No'}, 'blue-ray'),
 ({'multiple': 'Yes'}, 'DVD,VHS'),
 ({'multiple': 'No'}, 'DVD'),
 ({'multiple': 'Yes'}, 'DVD,digital,Netflix'),
 ({'multiple': 'Yes'}, 'Online,VHS'),
 ({'multiple': 'No'}, 'Blu_Ray')]

Just in case you are curious about using regex (`re`):

```python
import re

for form in root.findall("./genre/decade/movie/format"):
    # Search for the commas in the format text
    match = re.search(',',form.text)
    if match:
        form.set('multiple','Yes')
    else:
        form.set('multiple','No')

```

## Conclusion
There are some key things to remember about XMLs and using `ElementTree`.

Tags build the tree structure and designate what values should be delineated there. Using smart structuring can make it easy to read and write to an XML. Tags always need opening and closing brackets to show the parent and children relationships.

Attributes further describe how to validate a tag or allow for boolean designations. Attributes typically take very specific values so that the XML parser (and the user) can use the attributes to check the tag values.

ElementTree is an important Python library that allows you to parse and navigate an XML document. Using ElementTree breaks down the XML document in a tree structure that is easy to work with. When in doubt, print it out `(print(ET.tostring(root, encoding='utf8').decode('utf8')))` - use this helpful print statement to view the entire XML document at once. It helps to check when editing, adding, or removing from an XML.

Now you are equipped to understand XML and begin parsing!

# CODING ASSIGNMENT PART 1B

Scrape all the movie information from the `movies.xml` file, and store them in a list of dicts (name the list `movie_details`).

__NOTES/HINTS__:
1. A total of `14` movies are in this file, so your list should contain `14` elements.
2. Each element contains the details about a movie, which should contains keys (and respective values) as below (not in that particular order):
  + favorite
  + title
  + format
  + multiple
  + year
  + description
  + rating
3. Some of the values of the dict can be either strings or lists.

In [None]:
import xml.etree.ElementTree as ET

import requests


xml_url = "https://raw.githubusercontent.com/DrJieTao/ba505-docs/master/data/movies.xml"
response = requests.get(xml_url)


!wget "https://raw.githubusercontent.com/DrJieTao/ba505-docs/master/data/movies.xml" ./movies.xml





--2022-10-15 16:16:19--  https://raw.githubusercontent.com/DrJieTao/ba505-docs/master/data/movies.xml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5025 (4.9K) [text/plain]
Saving to: ‘movies.xml’


2022-10-15 16:16:19 (45.0 MB/s) - ‘movies.xml’ saved [5025/5025]

--2022-10-15 16:16:19--  http://./movies.xml
Resolving . (.)... failed: No address associated with hostname.
wget: unable to resolve host address ‘.’
FINISHED --2022-10-15 16:16:19--
Total wall clock time: 0.2s
Downloaded: 1 files, 4.9K in 0s (45.0 MB/s)


In [None]:
tree = ET.ElementTree(ET.fromstring(response.content, parser = ET.XMLParser(encoding = 'iso-8859-5')))
root = tree.getroot()
root

<Element 'collection' at 0x7f29644ce2f0>

In [None]:
elements = [elem.tag for elem in root.iter()]

elements

['collection',
 'genre',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'genre',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'genre',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description']

In [None]:
import pandas as pd

#list of dicts

movie_details = []

#Since favorite and title are attributes of the movie tag itself the first for
#loop is used to get those two at the movie level

for child in root.findall('.genre/decade/movie'):

  details = {'title': '','favorite': '', 'format': '', 'multiple': '', 'year': '', 'description': '', 'rating': ''}

  # favorite and title are attributes of movie
  details['title'] = child.attrib['title']
  details['favorite'] = child.attrib['favorite']

#The remaining details needed are children of the movie tag so I will need to
#iterate through the children of movie
  for child_of_child in child: #for child in movie

  #iterating to find the matching tags need
    if child_of_child.tag == 'format':
      details['format'] = child_of_child.text
      details['multiple'] = child_of_child.attrib['multiple']

    elif child_of_child.tag == 'year':
      details['year'] = child_of_child.text

    elif child_of_child.tag == 'description':
      details['description'] = child_of_child.text

    elif child_of_child.tag == 'rating':
      details['rating'] = child_of_child.text

  movie_details.append(details)

#Displaying List of Dicts

# movie_details

df = pd.DataFrame(movie_details)

#Start row count from 1 instead of 0

df.index = df.index + 1

#numpy version: df.index = np.arange(1, len(df) + 1)

#displaying a dataframe of the movie details

df



Unnamed: 0,title,favorite,format,multiple,year,description,rating
1,Indiana Jones: The raiders of the lost Ark,True,DVD,No,1981,\n 'Archaeologist and adventure...,PG
2,THE KARATE KID,True,"DVD,Online",Yes,1984,None provided.,PG
3,Back 2 the Future,False,Blu-ray,False,1985,Marty McFly,PG
4,X-Men,False,"dvd, digital",Yes,2000,Two mutants come to a private academy for thei...,PG-13
5,Batman Returns,True,VHS,No,1992,NA.,PG13
6,Reservoir Dogs,False,Online,No,1992,WhAtEvER I Want!!!?!,R
7,ALIEN,False,DVD,Yes,1979,"""""""""""""""""""",R
8,Ferris Bueller's Day Off,True,DVD,No,1986,Funny movie about a funny guy,PG13
9,American Psycho,False,blue-ray,No,2000,psychopathic Bateman,Unrated
10,Batman: The Movie,False,"DVD,VHS",Yes,1966,What a joke!,PG


You can use this code block to test if your code works correctly, if there is no error you should be fine.

In [None]:
#### do not edit this code block
assert len(movie_details) == 14
all_keys = ['favorite', 'title', 'format', 'multiple', 'year', 'description', 'rating']
for i, movie_detail in enumerate(movie_details):
  try:
    assert set(movie_detail.keys()) == set(all_keys)
  except:
    if movie_detail['title']:
      print(movie_detail['title'], "missing information", set(all_keys) - set(movie_detail.keys()))
    else:
      print("The {}-th entry in movie_details missing information {}".format(i, set(all_keys) - set(movie_detail.keys())))