# Hierarchical Data

## Hierarchical Data

Consider a data set of TV shows.

Each show has...

- a name
- a network
- ...
- multiple seasons, each of which has...
    - a number
    - an airdate
    - ...
- a cast of multiple actors.

We can think of this data as a *tree*.

![a tree of TV shows](data/tree.png)

But how would we represent this data as a file on disk?

## JavaScript Object Notation (JSON)

JSON is one way to represent hierarchical data. In Python, JSON is represented as a `dict`.

In [None]:
[{'name': 'Girls',
    'network': {'name': 'NBC', ...},
    ...,
    'cast': [{'person': {'name': 'Lena Dunham', ...},
            'character': {'name': 'Hannah Horvath', ...},
            'voice': False},
            ...
            ],
 'seasons': [{'premiereDate': '2012-04-15',
            ...,
            'episodes': [{'name': 'Pilot',
                        'number': 1,
                        'runtime': 30,
                        ...},
                        ...
                    ]
                }
            ]
        }
     ]

## JSON

The JavaScript Object Notation, or **JSON**, data format is a popular way to represent hierarchical data. Despite its name, its application extends far beyond JavaScript, the language for which it was originally designed.

Let's read in a JSON file and print out the first observation. (_Warning:_ Never try to print the entire contents of a JSON file in a Jupyter notebook; this will freeze the notebook if the file is large!)

In [4]:
import requests
import json

# Fetch data from a URL
# response = requests.get("https://dlsun.github.io/pods/data/tvshows.json")

# write the file from the response
# with open("data/tvshows.json", "w") as f:
#    f.write(response.text)

# JSON object can be accessed directly from the response.
# data_shows = response.json()
with open("data/tvshows.json", "r") as f:
    data_shows = json.load(f)

# Display the first observation
data_shows[0]

{'id': 139,
 'url': 'http://www.tvmaze.com/shows/139/girls',
 'name': 'Girls',
 'type': 'Scripted',
 'language': 'English',
 'genres': ['Drama', 'Romance'],
 'status': 'Ended',
 'runtime': 30,
 'premiered': '2012-04-15',
 'officialSite': 'http://www.hbo.com/girls',
 'schedule': {'time': '22:00', 'days': ['Sunday']},
 'rating': {'average': 6.9},
 'weight': 75,
 'network': {'id': 8,
  'name': 'HBO',
  'country': {'name': 'United States',
   'code': 'US',
   'timezone': 'America/New_York'}},
 'webChannel': None,
 'externals': {'tvrage': 30124, 'thetvdb': 220411, 'imdb': 'tt1723816'},
 'image': {'medium': 'http://static.tvmaze.com/uploads/images/medium_portrait/31/78286.jpg',
  'original': 'http://static.tvmaze.com/uploads/images/original_untouched/31/78286.jpg'},
 'summary': '<p>This Emmy winning series is a comic look at the assorted humiliations and rare triumphs of a group of girls in their 20s.</p>',
 'updated': 1577601053,
 'cast': [{'person': {'id': 27410,
    'url': 'http://www.tvm

Now let's investigate the JSON data that we just loaded, again being careful not to print out all of data. Let's start by looking at the top-level variables associated with each TV show.

In [5]:
show = data_shows[0]  # data for the first TV show
show.keys()

dict_keys(['id', 'url', 'name', 'type', 'language', 'genres', 'status', 'runtime', 'premiered', 'officialSite', 'schedule', 'rating', 'weight', 'network', 'webChannel', 'externals', 'image', 'summary', 'updated', 'cast', 'seasons'])

We see variables like **`name`** and **`network`**, but also "variables" like **`cast`** and **`seasons`**, which contain multiple values.

In [6]:
show["name"]

'Girls'

In [7]:
show["cast"]

[{'person': {'id': 27410,
   'url': 'http://www.tvmaze.com/people/27410/lena-dunham',
   'name': 'Lena Dunham',
   'country': {'name': 'United States',
    'code': 'US',
    'timezone': 'America/New_York'},
   'birthday': '1986-05-13',
   'deathday': None,
   'gender': 'Female',
   'image': {'medium': 'http://static.tvmaze.com/uploads/images/medium_portrait/3/7597.jpg',
    'original': 'http://static.tvmaze.com/uploads/images/original_untouched/3/7597.jpg'}},
  'character': {'id': 36886,
   'url': 'http://www.tvmaze.com/characters/36886/girls-hannah-horvath',
   'name': 'Hannah Horvath',
   'image': {'medium': 'http://static.tvmaze.com/uploads/images/medium_portrait/0/1954.jpg',
    'original': 'http://static.tvmaze.com/uploads/images/original_untouched/0/1954.jpg'}},
  'self': False,
  'voice': False},
 {'person': {'id': 11102,
   'url': 'http://www.tvmaze.com/people/11102/allison-williams',
   'name': 'Allison Williams',
   'country': {'name': 'United States',
    'code': 'US',
    '

A "variable" (like **`cast`**) with multiple values is called a _repeated field_. A repeated field might itself contain a repeated field (e.g., each show has multiple seasons, and each season in turn has multiple episodes), creating a hierarchy of variables. Repeated fields are represented as lists or arrays in JSON.

Let's take a closer look at how each cast member is represented, by zooming in on the first cast member.

In [8]:
show["cast"][0]

{'person': {'id': 27410,
  'url': 'http://www.tvmaze.com/people/27410/lena-dunham',
  'name': 'Lena Dunham',
  'country': {'name': 'United States',
   'code': 'US',
   'timezone': 'America/New_York'},
  'birthday': '1986-05-13',
  'deathday': None,
  'gender': 'Female',
  'image': {'medium': 'http://static.tvmaze.com/uploads/images/medium_portrait/3/7597.jpg',
   'original': 'http://static.tvmaze.com/uploads/images/original_untouched/3/7597.jpg'}},
 'character': {'id': 36886,
  'url': 'http://www.tvmaze.com/characters/36886/girls-hannah-horvath',
  'name': 'Hannah Horvath',
  'image': {'medium': 'http://static.tvmaze.com/uploads/images/medium_portrait/0/1954.jpg',
   'original': 'http://static.tvmaze.com/uploads/images/original_untouched/0/1954.jpg'}},
 'self': False,
 'voice': False}

It appears that each cast member is itself a dictionary with four keys: **`person`** (i.e., the actor), **`character`**, **`self`**, and **`voice`**. The first two attributes are themselves dictionaries containing further information about the actor and the character, while the last two attributes are booleans.

If we wanted to know which show had the biggest cast, we could first get the show name and actor name by traversing the levels using nested loops:

In [9]:
shows = []
actors = []
for show in data_shows:
  for castmember in show["cast"]:
    # exclude voice actors
    if not castmember["voice"]:
      shows.append(show["name"])
      actors.append(castmember["person"]["name"])

shows, actors

(['Girls',
  'Girls',
  'Girls',
  'Girls',
  'Girls',
  'Girls',
  'Girls',
  'Girls',
  'The Golden Girls',
  'The Golden Girls',
  'The Golden Girls',
  'The Golden Girls',
  'Good Girls',
  'Good Girls',
  'Good Girls',
  'Good Girls',
  'Good Girls',
  'Good Girls',
  'Good Girls',
  'Good Girls',
  'Florida Girls',
  'Florida Girls',
  'Florida Girls',
  'Florida Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Chicken Girls',
  'Derry Girls',
  'Derry Girls

However, it is often easier to work with hierarchical data by first flattening it to a `DataFrame`.

### Flattening Hierarchical Data

Although hierarchical data cannot be efficiently represented using a `DataFrame`, most questions do not require all of the information in the data. In these cases, it is helpful to first "flatten" the hierarchical data into a `DataFrame`.

For example, suppose we want to know the average runtime of shows. To answer this question, it suffices to work with a `DataFrame` with one row per show. We can use the `json_normalize()` function in `pandas` to flatten the data into a `DataFrame` of this form.

In [10]:
import pandas as pd

df_shows = pd.json_normalize(data_shows)
df_shows

Unnamed: 0,id,url,name,type,language,genres,status,runtime,premiered,officialSite,...,externals.thetvdb,externals.imdb,image.medium,image.original,network,webChannel.id,webChannel.name,webChannel.country.name,webChannel.country.code,webChannel.country.timezone
0,139,http://www.tvmaze.com/shows/139/girls,Girls,Scripted,English,"[Drama, Romance]",Ended,30,2012-04-15,http://www.hbo.com/girls,...,220411,tt1723816,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
1,722,http://www.tvmaze.com/shows/722/the-golden-girls,The Golden Girls,Scripted,English,"[Drama, Comedy]",Ended,30,1985-09-14,,...,71292,tt0088526,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
2,23542,http://www.tvmaze.com/shows/23542/good-girls,Good Girls,Scripted,English,"[Drama, Comedy, Crime]",Running,60,2018-02-26,https://www.nbc.com/good-girls?nbc=1,...,328577,tt6474378,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
3,6771,http://www.tvmaze.com/shows/6771/the-powerpuff...,The Powerpuff Girls,Animation,English,"[Comedy, Action, Science-Fiction]",Running,15,2016-04-04,https://www.cartoonnetwork.com/video/powerpuff...,...,307473,tt4718304,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
4,42726,http://www.tvmaze.com/shows/42726/florida-girls,Florida Girls,Scripted,English,[Comedy],Running,30,2019-07-10,https://poptv.com/floridagirls,...,363682,tt8548870,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
5,32087,http://www.tvmaze.com/shows/32087/chicken-girls,Chicken Girls,Scripted,English,"[Drama, Children, Music]",Running,16,2017-09-05,https://www.youtube.com/playlist?list=PLVewHiZ...,...,339854,,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,274.0,Brat,United States,US,America/New_York
6,33320,http://www.tvmaze.com/shows/33320/derry-girls,Derry Girls,Scripted,English,[Comedy],Running,30,2018-01-04,http://www.channel4.com/programmes/derry-girls,...,338903,tt7120662,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
7,1955,http://www.tvmaze.com/shows/1955/the-powerpuff...,The Powerpuff Girls,Animation,English,"[Action, Children, Crime]",Ended,30,1998-11-18,,...,76200,tt0175058,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
8,1073,http://www.tvmaze.com/shows/1073/bomb-girls,Bomb Girls,Scripted,English,"[Drama, Romance, War]",Ended,60,2012-01-04,,...,254378,tt1955311,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
9,525,http://www.tvmaze.com/shows/525/gilmore-girls,Gilmore Girls,Scripted,English,"[Drama, Comedy, Romance]",Ended,60,2000-10-05,,...,76568,tt0238784,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,


In [11]:
df_shows["runtime"].mean()

np.float64(36.1)

Let us take a closer look at the columns of this `DataFrame`.

In [12]:
df_shows.columns

Index(['id', 'url', 'name', 'type', 'language', 'genres', 'status', 'runtime',
       'premiered', 'officialSite', 'weight', 'webChannel', 'summary',
       'updated', 'cast', 'seasons', 'schedule.time', 'schedule.days',
       'rating.average', 'network.id', 'network.name', 'network.country.name',
       'network.country.code', 'network.country.timezone', 'externals.tvrage',
       'externals.thetvdb', 'externals.imdb', 'image.medium', 'image.original',
       'network', 'webChannel.id', 'webChannel.name',
       'webChannel.country.name', 'webChannel.country.code',
       'webChannel.country.timezone'],
      dtype='object')

Notice that:

- Fields that were themselves dictionaries, such as **`network`**, have been expanded into multiple columns, with names like **`network.name`**, **`network.country.name`**, etc.
- Repeated fields, such as **`cast`** and **`seasons`**, are just columns in this `DataFrame`.

Let's take a look at one of these repeated fields.

In [13]:
df_shows["cast"]

0    [{'person': {'id': 27410, 'url': 'http://www.t...
1    [{'person': {'id': 50062, 'url': 'http://www.t...
2    [{'person': {'id': 32974, 'url': 'http://www.t...
3    [{'person': {'id': 70135, 'url': 'http://www.t...
4    [{'person': {'id': 76588, 'url': 'http://www.t...
5    [{'person': {'id': 188808, 'url': 'http://www....
6    [{'person': {'id': 72647, 'url': 'http://www.t...
7    [{'person': {'id': 60712, 'url': 'http://www.t...
8    [{'person': {'id': 26252, 'url': 'http://www.t...
9    [{'person': {'id': 6800, 'url': 'http://www.tv...
Name: cast, dtype: object

These columns just contain a dump of the raw JSON. The information in these columns is not readily accessible.

### A Simple Example

Now, suppose we want to get a list of the non-voice actors in the data set, like with did above, but using `pd.json_normalize()`?

We can specify the observational unit in `pd.json_normalize()`. So if we wanted a `DataFrame` where each row represents a cast member, we would pass in the name of that variable in the JSON data (i.e., **`cast`**) to `json_normalize()`.

In [14]:
df_actors = pd.json_normalize(data_shows, "cast")
df_actors

Unnamed: 0,self,voice,person.id,person.url,person.name,person.country.name,person.country.code,person.country.timezone,person.birthday,person.deathday,person.gender,person.image.medium,person.image.original,character.id,character.url,character.name,character.image.medium,character.image.original,person.country,character.image
0,False,False,27410,http://www.tvmaze.com/people/27410/lena-dunham,Lena Dunham,United States,US,America/New_York,1986-05-13,,Female,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,36886,http://www.tvmaze.com/characters/36886/girls-h...,Hannah Horvath,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,
1,False,False,11102,http://www.tvmaze.com/people/11102/allison-wil...,Allison Williams,United States,US,America/New_York,1988-04-13,,Female,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,36887,http://www.tvmaze.com/characters/36887/girls-m...,Marnie Michaels,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,
2,False,False,27411,http://www.tvmaze.com/people/27411/jemima-kirke,Jemima Kirke,United Kingdom,GB,Europe/London,1985-04-26,,Female,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,36888,http://www.tvmaze.com/characters/36888/girls-j...,Jessa Johansson,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,
3,False,False,16405,http://www.tvmaze.com/people/16405/zosia-mamet,Zosia Mamet,United States,US,America/New_York,1988-02-02,,Female,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,36889,http://www.tvmaze.com/characters/36889/girls-s...,Shoshanna Shapiro,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,
4,False,False,24858,http://www.tvmaze.com/people/24858/adam-driver,Adam Driver,United States,US,America/New_York,1983-11-19,,Male,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,36890,http://www.tvmaze.com/characters/36890/girls-a...,Adam Sackler,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107,False,False,20369,http://www.tvmaze.com/people/20369/edward-herr...,Edward Herrmann,United States,US,America/New_York,1943-07-21,2014-12-31,Male,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,85545,http://www.tvmaze.com/characters/85545/gilmore...,Richard Gilmore,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,
108,False,False,3354,http://www.tvmaze.com/people/3354/jared-padalecki,Jared Padalecki,United States,US,America/New_York,1982-07-19,,Male,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,85552,http://www.tvmaze.com/characters/85552/gilmore...,Dean Forester,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,
109,False,False,18121,http://www.tvmaze.com/people/18121/matt-czuchry,Matt Czuchry,United States,US,America/New_York,1977-05-20,,Male,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,85550,http://www.tvmaze.com/characters/85550/gilmore...,Logan Huntzberger,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,
110,False,False,20026,http://www.tvmaze.com/people/20026/milo-ventim...,Milo Ventimiglia,United States,US,America/New_York,1977-07-08,,Male,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,85553,http://www.tvmaze.com/characters/85553/gilmore...,Jess Mariano,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,


There is just one problem. We have lost all the fields in the data in the levels _above_ **`cast`**. If there are any fields from these levels that we want to keep, we must specify them explicitly in the `meta=` argument.

For example, we might want to know the **`name`** of the TV show. So we specify `meta="name"` (as well as a `meta_prefix` to avoid naming conflicts).

In [15]:
df_actors = pd.json_normalize(data_shows, "cast",
                              meta="name", meta_prefix="tvshow.")
df_actors

Unnamed: 0,self,voice,person.id,person.url,person.name,person.country.name,person.country.code,person.country.timezone,person.birthday,person.deathday,...,person.image.medium,person.image.original,character.id,character.url,character.name,character.image.medium,character.image.original,person.country,character.image,tvshow.name
0,False,False,27410,http://www.tvmaze.com/people/27410/lena-dunham,Lena Dunham,United States,US,America/New_York,1986-05-13,,...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,36886,http://www.tvmaze.com/characters/36886/girls-h...,Hannah Horvath,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,Girls
1,False,False,11102,http://www.tvmaze.com/people/11102/allison-wil...,Allison Williams,United States,US,America/New_York,1988-04-13,,...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,36887,http://www.tvmaze.com/characters/36887/girls-m...,Marnie Michaels,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,Girls
2,False,False,27411,http://www.tvmaze.com/people/27411/jemima-kirke,Jemima Kirke,United Kingdom,GB,Europe/London,1985-04-26,,...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,36888,http://www.tvmaze.com/characters/36888/girls-j...,Jessa Johansson,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,Girls
3,False,False,16405,http://www.tvmaze.com/people/16405/zosia-mamet,Zosia Mamet,United States,US,America/New_York,1988-02-02,,...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,36889,http://www.tvmaze.com/characters/36889/girls-s...,Shoshanna Shapiro,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,Girls
4,False,False,24858,http://www.tvmaze.com/people/24858/adam-driver,Adam Driver,United States,US,America/New_York,1983-11-19,,...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,36890,http://www.tvmaze.com/characters/36890/girls-a...,Adam Sackler,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,Girls
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107,False,False,20369,http://www.tvmaze.com/people/20369/edward-herr...,Edward Herrmann,United States,US,America/New_York,1943-07-21,2014-12-31,...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,85545,http://www.tvmaze.com/characters/85545/gilmore...,Richard Gilmore,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,Gilmore Girls
108,False,False,3354,http://www.tvmaze.com/people/3354/jared-padalecki,Jared Padalecki,United States,US,America/New_York,1982-07-19,,...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,85552,http://www.tvmaze.com/characters/85552/gilmore...,Dean Forester,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,Gilmore Girls
109,False,False,18121,http://www.tvmaze.com/people/18121/matt-czuchry,Matt Czuchry,United States,US,America/New_York,1977-05-20,,...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,85550,http://www.tvmaze.com/characters/85550/gilmore...,Logan Huntzberger,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,Gilmore Girls
110,False,False,20026,http://www.tvmaze.com/people/20026/milo-ventim...,Milo Ventimiglia,United States,US,America/New_York,1977-07-08,,...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,85553,http://www.tvmaze.com/characters/85553/gilmore...,Jess Mariano,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,Gilmore Girls


Now it's easy to determine the show with the biggest cast.

In [16]:
df_actors[~df_actors["voice"]]["tvshow.name"].value_counts()

tvshow.name
Chicken Girls       31
Bomb Girls          15
Gilmore Girls       14
Derry Girls         12
Good Girls           8
Girls                8
Florida Girls        4
The Golden Girls     4
Name: count, dtype: int64

## eXtensible Markup Language (XML)

XML is another way to represent hierarchical data.

- Fields are represented by named *tags*.
- Each tag has an open `<tag>` and a close `</tag>`.
- Children are represented by nested tags.
- Repeated fields ard represented by repeated tags.

In [None]:
<?xml version="1.0" encoding="UTF-8"?>
<root>
    <show>
        <name>Girls</name>
        <network>
            <name>NBC</name>
            ...
        </network>
        <cast>
            <person>...</person>
            <character>....</character>
            <voice>...</voice>
        </cast>
        <cast>
            <person>...</person>
            <character>....</character>
            <voice>...</voice>
        </cast>
        <season>
            <episode>...</episode>
            <episode>...</episode>
            ...
        </season>
        <season>
            ...
        </season>
    </show>
</root>

Technical details:

- XML documents must begin with the declaration `<?xml ... ?>`.
- XML documents must have a `<root>` tag.

## XML

First, we read in the data using a library called BeautifulSoup.

In [None]:
from bs4 import BeautifulSoup
response = requests.get("https://dlsun.github.io/pods/data/tvshows.xml")
soup = BeautifulSoup(response.text, 'xml')