# Nested Structures for Data Analytics in Pandas

<div style="width:200px"><img src="images/python_xml_json.png" alt="sql analytics icon" width="200px" style="text-align:left;"/></div>
<br/>

## Popular Structured Data File Formats

<ul style="font-size:20px;line-height:24px">
    <li>Text delimited files: comma-separated, tab-delimited, pipe-delimited, etc.</li>
    <li>Nested files: XML, JSON, HTML, YAML, etc.</li>
    <li>Binary files: Excel spreadsheets, Stata .dta, Python pickles, etc.</li>
</ul>
<br/>


<hr style="border: 0.5px solid #E0E0E0;"/>

<h2>XML Data</h2>

<span style="font-size:20px;line-height:24px;">(E<b>x</b>tensible <b>M</b>arkup <b>L</b>anguage)</span>

<ul style="font-size:20px;line-height:24px">
    <li>Tree-like, markup nested structures (i.e., root, nodes)</li>
    <li>Decades-long industry standards and specifications</li>
    <ul style="font-size:18px;line-height:22px"><li>E.g., KML, MathML, MusicXML, RDF, RSS, SVG</li></ul>
    <li>Used in APIs, archives, data dumps, metadata</li>
</ul>

<hr style="border: 0.5px solid #E0E0E0;"/>

## Example Data:

## City of Chicago: CTA Monthly L Rides
<span style="font-size: 18px"><a href="https://data.cityofchicago.org/Transportation/CTA-Ridership-L-Station-Entries-Monthly-Day-Type-A/t2rn-p8d7">https://data.cityofchicago.org/Transportation/CTA-Ridership-L-Station-Entries-Monthly-Day-Type-A/t2rn-p8d7</a></span>

In [1]:
%%html
<style>.prompt{width: 100px; min-width: 0; visibility: collapse}</style>

In [1]:
import xml.etree.ElementTree as xet   # BUILT-IN MODULE
import lxml.etree as lxet             # THIRD-PARTY MODULE

import pandas as pd

In [3]:
doc = xet.parse("data/cta_monthly_l_rides_preview.xml")
doc2 = lxet.parse("data/cta_monthly_l_rides_preview.xml")

print(xet.tostring(doc.getroot()).decode("utf-8"))

<response>
  <row>
    <row _id="row-fwjy-9q5k-ps6p" _uuid="00000000-0000-0000-5FAB-5CAB5D21E10C" _position="0" _address="https://data.cityofchicago.org/resource/t2rn-p8d7/row-fwjy-9q5k-ps6p">
      <station_id>40900</station_id>
      <stationame>Howard</stationame>
      <month_beginning>2001-01-01T00:00:00</month_beginning>
      <avg_weekday_rides>6233.9</avg_weekday_rides>
      <avg_saturday_rides>3814.5</avg_saturday_rides>
      <avg_sunday_holiday_rides>2408.6</avg_sunday_holiday_rides>
      <monthtotal>164447</monthtotal>
    </row>
    <row _id="row-nyyd~hctf_nrwa" _uuid="00000000-0000-0000-6DE8-BA4BB25FC36E" _position="0" _address="https://data.cityofchicago.org/resource/t2rn-p8d7/row-nyyd~hctf_nrwa">
      <station_id>41190</station_id>
      <stationame>Jarvis</stationame>
      <month_beginning>2001-01-01T00:00:00</month_beginning>
      <avg_weekday_rides>1489.1</avg_weekday_rides>
      <avg_saturday_rides>1054</avg_saturday_rides>
      <avg_sunday_holiday_rides>718<

### <strike><code>pandas.read_xml()</code></strike>

In [4]:
# df = pd.read_xml("data/cta_monthly_l_rides_preview.xml")        # NOT AVAILABLE YET

<br/>

<h3>Merge Dictionaries in Nested List/Dict Comprehension</h3>

<h3><code>[{...}] + pandas.DataFrame()</code></h3>

In [5]:
data_dict = [{**{row.tag:row.text for row in rows.findall('*')}, **rows.attrib}
                 for rows in doc.findall('.//row/row')]

df = pd.DataFrame(data_dict)

df

Unnamed: 0,station_id,stationame,month_beginning,avg_weekday_rides,avg_saturday_rides,avg_sunday_holiday_rides,monthtotal,_id,_uuid,_position,_address
0,40900,Howard,2001-01-01T00:00:00,6233.9,3814.5,2408.6,164447,row-fwjy-9q5k-ps6p,00000000-0000-0000-5FAB-5CAB5D21E10C,0,https://data.cityofchicago.org/resource/t2rn-p...
1,41190,Jarvis,2001-01-01T00:00:00,1489.1,1054.0,718.0,40567,row-nyyd~hctf_nrwa,00000000-0000-0000-6DE8-BA4BB25FC36E,0,https://data.cityofchicago.org/resource/t2rn-p...
2,40100,Morse,2001-01-01T00:00:00,4412.5,3064.5,2087.8,119772,row-t6pr.n6ds.4am9,00000000-0000-0000-2191-71EC625EDF2F,0,https://data.cityofchicago.org/resource/t2rn-p...
3,41300,Loyola,2001-01-01T00:00:00,4664.5,3156.0,1952.8,125008,row-cjyk~sup7-huut,00000000-0000-0000-0AE6-ACE513A0342D,0,https://data.cityofchicago.org/resource/t2rn-p...
4,40760,Granville,2001-01-01T00:00:00,3109.8,2126.0,1453.8,84189,row-8ikx~cq3f.zwm2,00000000-0000-0000-5665-E0B11642D467,0,https://data.cityofchicago.org/resource/t2rn-p...
5,40850,Library,2020-09-01T00:00:00,864.2,534.0,417.2,22370,row-vvtf_7vrm-97xy,00000000-0000-0000-5736-C996BCE35489,0,https://data.cityofchicago.org/resource/t2rn-p...
6,40680,Adams/Wabash,2020-09-01T00:00:00,1332.3,791.8,601.6,34154,row-n9ei.vu5v.yd52,00000000-0000-0000-4734-5E409BEA7F70,0,https://data.cityofchicago.org/resource/t2rn-p...
7,41700,Washington/Wabash,2020-09-01T00:00:00,2707.4,1909.8,1438.6,71688,row-h278_29y8-vcs5,00000000-0000-0000-2D6E-542239E417F5,0,https://data.cityofchicago.org/resource/t2rn-p...
8,40260,State/Lake,2020-09-01T00:00:00,2410.0,2042.3,1660.6,67083,row-icuu~a5em_e9n4,00000000-0000-0000-368B-C636EBE912C2,0,https://data.cityofchicago.org/resource/t2rn-p...
9,40380,Clark/Lake,2020-09-01T00:00:00,2949.6,1657.0,1453.8,75839,row-na7a~6yct-k8fv,00000000-0000-0000-965D-B7B17499AB88,0,https://data.cityofchicago.org/resource/t2rn-p...


## XPath
<span style="font-size:20px">(Query language to search/retrieve nodes within XML documents)</span>

In [6]:
doc2 = lxet.parse("data/cta_monthly_l_rides.xml")

data_dict = [{row.tag:row.text for row in rows.xpath('*')} 
                 for rows in doc2.xpath(".//row/row[stationame='Midway Airport' and "
                                        "           avg_weekday_rides < 3000 ]")]

df2 = pd.DataFrame(data_dict)
             
df2

Unnamed: 0,station_id,stationame,month_beginning,avg_weekday_rides,avg_saturday_rides,avg_sunday_holiday_rides,monthtotal
0,40930,Midway Airport,2020-04-01T00:00:00,982.0,561.5,430.5,25571
1,40930,Midway Airport,2020-05-01T00:00:00,1030.7,648.0,452.7,26570
2,40930,Midway Airport,2020-06-01T00:00:00,1413.8,930.8,805.3,38047
3,40930,Midway Airport,2020-07-01T00:00:00,2022.7,1149.8,1132.8,54763
4,40930,Midway Airport,2020-08-01T00:00:00,2002.3,1245.6,1047.6,53515
5,40930,Midway Airport,2020-09-01T00:00:00,2162.7,1315.5,1086.8,56113


## Nested XML

In [2]:
doc = xet.parse("data/cta_monthly_l_rides_nested.xml")
doc2 = lxet.parse("data/cta_monthly_l_rides_nested.xml")

print(xet.tostring(doc.getroot()).decode("utf-8"))

<response>
  <row>
    <station id="40900" name="Howard" />
    <month>2001-01-01T00:00:00</month>
    <rides>
      <avg_weekday_rides>6233.9</avg_weekday_rides>
      <avg_saturday_rides>3814.5</avg_saturday_rides>
      <avg_sunday_holiday_rides>2408.6</avg_sunday_holiday_rides>
    </rides>
  </row>
  <row>
    <station id="41190" name="Jarvis" />
    <month>2001-01-01T00:00:00</month>
    <rides>
      <avg_weekday_rides>1489.1</avg_weekday_rides>
      <avg_saturday_rides>1054</avg_saturday_rides>
      <avg_sunday_holiday_rides>718</avg_sunday_holiday_rides>
    </rides>
  </row>
  <row>
    <station id="40100" name="Morse" />
    <month>2001-01-01T00:00:00</month>
    <rides>
      <avg_weekday_rides>4412.5</avg_weekday_rides>
      <avg_saturday_rides>3064.5</avg_saturday_rides>
      <avg_sunday_holiday_rides>2087.8</avg_sunday_holiday_rides>
    </rides>
  </row>
  <row>
    <station id="41300" name="Loyola" />
    <month>2001-01-01T00:00:00</month>
    <rides>
      <avg_we

## Merge Dictionaries in Nested List/Dict Comprehension

<h3><code>[{...}] + pandas.DataFrame()</code></h3>

In [6]:
data_dict = [{**{'station_id':rows.find('station').attrib['id'], 
                 'station_name':rows.find('station').attrib['name']}, 
              **{'month':rows.find('month').text}, 
              **{r.tag:r.text for r in rows.findall('rides/*')}}
                       for rows in doc.findall('.//row')]

df = pd.DataFrame(data_dict)
             
df

Unnamed: 0,station_id,station_name,month,avg_weekday_rides,avg_saturday_rides,avg_sunday_holiday_rides
0,40900,Howard,2001-01-01T00:00:00,6233.9,3814.5,2408.6
1,41190,Jarvis,2001-01-01T00:00:00,1489.1,1054.0,718.0
2,40100,Morse,2001-01-01T00:00:00,4412.5,3064.5,2087.8
3,41300,Loyola,2001-01-01T00:00:00,4664.5,3156.0,1952.8
4,40760,Granville,2001-01-01T00:00:00,3109.8,2126.0,1453.8
5,40850,Library,2020-09-01T00:00:00,864.2,534.0,417.2
6,40680,Adams/Wabash,2020-09-01T00:00:00,1332.3,791.8,601.6
7,41700,Washington/Wabash,2020-09-01T00:00:00,2707.4,1909.8,1438.6
8,40260,State/Lake,2020-09-01T00:00:00,2410.0,2042.3,1660.6
9,40380,Clark/Lake,2020-09-01T00:00:00,2949.6,1657.0,1453.8


## XSLT
<span style="font-size:20px">(Script language to transform XML documents)</span>

In [7]:
xsl_str = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
                <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
                <xsl:strip-space elements="*"/>

                <xsl:template match="/response">
                 <xsl:copy>
                   <xsl:apply-templates select="row"/>
                 </xsl:copy>
                </xsl:template>
            
                <xsl:template match="row">
                 <xsl:copy>
                   <station_id><xsl:value-of select="station/@id"/></station_id>
                   <station_name><xsl:value-of select="station/@name"/></station_name>
                   <xsl:copy-of select="month|rides/*"/>
                 </xsl:copy>
                </xsl:template>        

            </xsl:stylesheet>
          """

xsl = lxet.fromstring(xsl_str)
transformer = lxet.XSLT(xsl)

flat_doc = transformer(doc2)

print(flat_doc)

<response>
  <row>
    <station_id>40900</station_id>
    <station_name>Howard</station_name>
    <month>2001-01-01T00:00:00</month>
    <avg_weekday_rides>6233.9</avg_weekday_rides>
    <avg_saturday_rides>3814.5</avg_saturday_rides>
    <avg_sunday_holiday_rides>2408.6</avg_sunday_holiday_rides>
  </row>
  <row>
    <station_id>41190</station_id>
    <station_name>Jarvis</station_name>
    <month>2001-01-01T00:00:00</month>
    <avg_weekday_rides>1489.1</avg_weekday_rides>
    <avg_saturday_rides>1054</avg_saturday_rides>
    <avg_sunday_holiday_rides>718</avg_sunday_holiday_rides>
  </row>
  <row>
    <station_id>40100</station_id>
    <station_name>Morse</station_name>
    <month>2001-01-01T00:00:00</month>
    <avg_weekday_rides>4412.5</avg_weekday_rides>
    <avg_saturday_rides>3064.5</avg_saturday_rides>
    <avg_sunday_holiday_rides>2087.8</avg_sunday_holiday_rides>
  </row>
  <row>
    <station_id>41300</station_id>
    <station_name>Loyola</station_name>
    <month>2001-01-01

<hr style="border: 0.5px solid #E0E0E0;"/>


## JSON Data
<span style="font-size:20px;">(<b>J</b>ava <b>S</b>cript <b>O</b>bject <b>N</b>otation)</span>
 
<ul style="font-size:20px;line-height:24px">
    <li>Mapping key-value nested data structure</li>
    <li>Emerging web/application data format</li>
    <li>Less restrictive without markup rules</li>
</ul>

In [10]:
import json       # BUILT-IN MODULE

In [11]:
with open('data/cta_monthly_l_rides_preview.json') as f:    
    data = f.read()
    
print(data)

[
  {
    "station_id": "40900",
    "stationame": "Howard",
    "month_beginning": "2001-01-01T00:00:00",
    "avg_weekday_rides": "6233.9",
    "avg_saturday_rides": "3814.5",
    "avg_sunday_holiday_rides": "2408.6",
    "monthtotal": "164447"
  },
  {
    "station_id": "41190",
    "stationame": "Jarvis",
    "month_beginning": "2001-01-01T00:00:00",
    "avg_weekday_rides": "1489.1",
    "avg_saturday_rides": "1054",
    "avg_sunday_holiday_rides": "718",
    "monthtotal": "40567"
  },
  {
    "station_id": "40100",
    "stationame": "Morse",
    "month_beginning": "2001-01-01T00:00:00",
    "avg_weekday_rides": "4412.5",
    "avg_saturday_rides": "3064.5",
    "avg_sunday_holiday_rides": "2087.8",
    "monthtotal": "119772"
  },
  {
    "station_id": "41300",
    "stationame": "Loyola",
    "month_beginning": "2001-01-01T00:00:00",
    "avg_weekday_rides": "4664.5",
    "avg_saturday_rides": "3156",
    "avg_sunday_holiday_rides": "1952.8",
    "monthtotal": "125008"
  },
  {
   

<h3><code>pandas.read_json()</code></h3>

In [12]:
df = pd.read_json('data/cta_monthly_l_rides_preview.json')

df

Unnamed: 0,station_id,stationame,month_beginning,avg_weekday_rides,avg_saturday_rides,avg_sunday_holiday_rides,monthtotal
0,40900,Howard,2001-01-01T00:00:00,6233.9,3814.5,2408.6,164447
1,41190,Jarvis,2001-01-01T00:00:00,1489.1,1054.0,718.0,40567
2,40100,Morse,2001-01-01T00:00:00,4412.5,3064.5,2087.8,119772
3,41300,Loyola,2001-01-01T00:00:00,4664.5,3156.0,1952.8,125008
4,40760,Granville,2001-01-01T00:00:00,3109.8,2126.0,1453.8,84189
5,40850,Library,2020-09-01T00:00:00,864.2,534.0,417.2,22370
6,40680,Adams/Wabash,2020-09-01T00:00:00,1332.3,791.8,601.6,34154
7,41700,Washington/Wabash,2020-09-01T00:00:00,2707.4,1909.8,1438.6,71688
8,40260,State/Lake,2020-09-01T00:00:00,2410.0,2042.3,1660.6,67083
9,40380,Clark/Lake,2020-09-01T00:00:00,2949.6,1657.0,1453.8,75839


## Nested JSON

In [13]:
with open('data/cta_monthly_l_rides_nested.json') as f:    
    data = f.read()
    
print(data)

[
  {
    "station": {
      "station_id": "40900",
      "stationame": "Howard"
    },
    "month": "2001-01-01T00:00:00",
    "rides": {
      "avg_weekday_rides": "6233.9",
      "avg_saturday_rides": "3814.5",
      "avg_sunday_holiday_rides": "2408.6"
    }
  },
  {
    "station": {
      "station_id": "41190",
      "stationame": "Jarvis"
    },
    "month": "2001-01-01T00:00:00",
    "rides": {
      "avg_weekday_rides": "1489.1",
      "avg_saturday_rides": "1054",
      "avg_sunday_holiday_rides": "718"
    }
  },
  {
    "station": {
      "station_id": "40100",
      "stationame": "Morse"
    },
    "month": "2001-01-01T00:00:00",
    "rides": {
      "avg_weekday_rides": "4412.5",
      "avg_saturday_rides": "3064.5",
      "avg_sunday_holiday_rides": "2087.8"
    }
  },
  {
    "station": {
      "station_id": "41300",
      "stationame": "Loyola"
    },
    "month": "2001-01-01T00:00:00",
    "rides": {
      "avg_weekday_rides": "4664.5",
      "avg_saturday_rides": "315

In [14]:
chi_month_rides_df = pd.read_json('data/cta_monthly_l_rides_nested.json')

chi_month_rides_df 

Unnamed: 0,station,month,rides
0,"{'station_id': '40900', 'stationame': 'Howard'}",2001-01-01T00:00:00,"{'avg_weekday_rides': '6233.9', 'avg_saturday_..."
1,"{'station_id': '41190', 'stationame': 'Jarvis'}",2001-01-01T00:00:00,"{'avg_weekday_rides': '1489.1', 'avg_saturday_..."
2,"{'station_id': '40100', 'stationame': 'Morse'}",2001-01-01T00:00:00,"{'avg_weekday_rides': '4412.5', 'avg_saturday_..."
3,"{'station_id': '41300', 'stationame': 'Loyola'}",2001-01-01T00:00:00,"{'avg_weekday_rides': '4664.5', 'avg_saturday_..."
4,"{'station_id': '40760', 'stationame': 'Granvil...",2001-01-01T00:00:00,"{'avg_weekday_rides': '3109.8', 'avg_saturday_..."
5,"{'station_id': '40850', 'stationame': 'Library'}",2020-09-01T00:00:00,"{'avg_weekday_rides': '864.2', 'avg_saturday_r..."
6,"{'station_id': '40680', 'stationame': 'Adams/W...",2020-09-01T00:00:00,"{'avg_weekday_rides': '1332.3', 'avg_saturday_..."
7,"{'station_id': '41700', 'stationame': 'Washing...",2020-09-01T00:00:00,"{'avg_weekday_rides': '2707.4', 'avg_saturday_..."
8,"{'station_id': '40260', 'stationame': 'State/L...",2020-09-01T00:00:00,"{'avg_weekday_rides': '2410', 'avg_saturday_ri..."
9,"{'station_id': '40380', 'stationame': 'Clark/L...",2020-09-01T00:00:00,"{'avg_weekday_rides': '2949.6', 'avg_saturday_..."


<h3><code>pandas.json_normalize()</code></h3>

In [15]:
with open('data/cta_monthly_l_rides_nested.json') as f:    
    jdata = json.load(f)
    
chi_month_rides_df = pd.json_normalize(jdata)

chi_month_rides_df 

Unnamed: 0,month,station.station_id,station.stationame,rides.avg_weekday_rides,rides.avg_saturday_rides,rides.avg_sunday_holiday_rides
0,2001-01-01T00:00:00,40900,Howard,6233.9,3814.5,2408.6
1,2001-01-01T00:00:00,41190,Jarvis,1489.1,1054.0,718.0
2,2001-01-01T00:00:00,40100,Morse,4412.5,3064.5,2087.8
3,2001-01-01T00:00:00,41300,Loyola,4664.5,3156.0,1952.8
4,2001-01-01T00:00:00,40760,Granville,3109.8,2126.0,1453.8
5,2020-09-01T00:00:00,40850,Library,864.2,534.0,417.2
6,2020-09-01T00:00:00,40680,Adams/Wabash,1332.3,791.8,601.6
7,2020-09-01T00:00:00,41700,Washington/Wabash,2707.4,1909.8,1438.6
8,2020-09-01T00:00:00,40260,State/Lake,2410.0,2042.3,1660.6
9,2020-09-01T00:00:00,40380,Clark/Lake,2949.6,1657.0,1453.8


## Merge Dictionaries in List Comprehension

<h3><code>[{...}] + pandas.DataFrame()</code></h3>

In [16]:
with open('data/cta_monthly_l_rides_nested.json') as f:
    jdata = json.load(f)
    
print(type(jdata))
print(type(jdata[0]))

<class 'list'>
<class 'dict'>


In [17]:
data = [{**j['station'], **{'month': j['month']}, **j['rides']} for j in jdata]

chi_month_rides_df = pd.DataFrame(data)

chi_month_rides_df

Unnamed: 0,station_id,stationame,month,avg_weekday_rides,avg_saturday_rides,avg_sunday_holiday_rides
0,40900,Howard,2001-01-01T00:00:00,6233.9,3814.5,2408.6
1,41190,Jarvis,2001-01-01T00:00:00,1489.1,1054.0,718.0
2,40100,Morse,2001-01-01T00:00:00,4412.5,3064.5,2087.8
3,41300,Loyola,2001-01-01T00:00:00,4664.5,3156.0,1952.8
4,40760,Granville,2001-01-01T00:00:00,3109.8,2126.0,1453.8
5,40850,Library,2020-09-01T00:00:00,864.2,534.0,417.2
6,40680,Adams/Wabash,2020-09-01T00:00:00,1332.3,791.8,601.6
7,41700,Washington/Wabash,2020-09-01T00:00:00,2707.4,1909.8,1438.6
8,40260,State/Lake,2020-09-01T00:00:00,2410.0,2042.3,1660.6
9,40380,Clark/Lake,2020-09-01T00:00:00,2949.6,1657.0,1453.8


<br/>
<br/>
<br/>