
# DATA 304 — Module 4: Importing Data II
## Session 1 Demo Notebook — JSON and XML

This notebook accompanies the Session 1 lecture. It covers:
- Reading **flat JSON** into pandas
- Normalizing **nested JSON**
- Parsing **XML** with the Python standard library
- Handling irregular schemas, missing fields, and memory constraints
- Line-delimited JSON (NDJSON) and chunked processing


In [1]:

import json
import pandas as pd
from pathlib import Path
from pprint import pprint
import xml.etree.ElementTree as ET

DATA_DIR = Path('./data')
DATA_DIR.mkdir(parents=True, exist_ok=True)

print(f'Data directory: {DATA_DIR.resolve()}')

Data directory: /workspaces/examples/Module04/data



## 1) Create sample datasets

We generate small, realistic JSON/XML examples to ensure reproducibility without external files:
- `flat.json`: simple list of records
- `nested.json`: parent with nested list plus optional fields
- `records_mixed.json`: deliberately inconsistent keys across records
- `events.ndjson`: line-delimited JSON
- `users_posts.xml`: XML with attributes and nested elements


In [2]:

# 1) Flat JSON (list of records)
flat = [
    {"id": 1, "name": "Alice", "age": 30, "city": "Knoxville"},
    {"id": 2, "name": "Bob", "age": 27, "city": "Nashville"},
    {"id": 3, "name": "Carmen", "age": 34, "city": "Memphis"}
]
(flat_path := DATA_DIR / 'flat.json').write_text(json.dumps(flat, indent=2))

# 2) Nested JSON (object with nested arrays and optional fields)
nested = {
    "run_id": "2025-09-08T09:45:00Z",
    "source": "api.example/v1/users",
    "users": [
        {
            "user_id": 1,
            "profile": {"name": "Alice", "email": "alice@example.com"},
            "posts": [
                {"post_id": 101, "likes": 5, "tags": ["intro", "welcome"]},
                {"post_id": 102, "likes": 7, "tags": ["data", "json"]}
            ]
        },
        {
            "user_id": 2,
            "profile": {"name": "Bob", "email": None},
            "posts": [
                {"post_id": 201, "likes": 1, "tags": []}
            ],
            # Optional field missing in other records
            "plan": {"tier": "pro", "renewal_date": "2025-12-31"}
        }
    ]
}
(nested_path := DATA_DIR / 'nested.json').write_text(json.dumps(nested, indent=2))

# 3) Records with mixed keys (irregular schema)
records_mixed = [
    {"id": 1, "name": "Alice", "dept": "Data", "start_date": "2024-01-15"},
    {"id": 2, "full_name": "Bob B.", "department": "Analytics"},
    {"identifier": 3, "name": "Carmen", "dept": "Data", "status": "contract"}
]
(records_mixed_path := DATA_DIR / 'records_mixed.json').write_text(json.dumps(records_mixed, indent=2))

# 4) NDJSON (line-delimited JSON) for streaming/chunked processing demo
ndjson_lines = [
    {"event_id": "e1", "ts": "2025-09-01T12:00:00Z", "type": "login"},
    {"event_id": "e2", "ts": "2025-09-01T12:05:00Z", "type": "view", "page": "/home"},
    {"event_id": "e3", "ts": "2025-09-01T12:06:00Z", "type": "click", "selector": "#cta"},
]
with (ndjson_path := DATA_DIR / 'events.ndjson').open('w') as f:
    for line in ndjson_lines:
        f.write(json.dumps(line) + "\n")

# 5) XML dataset with attributes and nested nodes
xml_content = '''<?xml version="1.0" encoding="UTF-8"?>
<users>
  <user id="1">
    <name>Alice</name>
    <contact>
      <email>alice@example.com</email>
    </contact>
    <posts>
      <post id="101" likes="5">
        <tags>
          <tag>intro</tag>
          <tag>welcome</tag>
        </tags>
      </post>
      <post id="102" likes="7">
        <tags>
          <tag>data</tag>
          <tag>json</tag>
        </tags>
      </post>
    </posts>
  </user>
  <user id="2">
    <name>Bob</name>
    <contact>
      <email/>
    </contact>
    <posts>
      <post id="201" likes="1">
        <tags/>
      </post>
    </posts>
    <plan tier="pro" renewal_date="2025-12-31"/>
  </user>
</users>
'''
(users_xml_path := DATA_DIR / 'users_posts.xml').write_text(xml_content)

flat_path, nested_path, records_mixed_path, ndjson_path, users_xml_path

(PosixPath('data/flat.json'),
 PosixPath('data/nested.json'),
 PosixPath('data/records_mixed.json'),
 PosixPath('data/events.ndjson'),
 PosixPath('data/users_posts.xml'))


## 2) Reading flat JSON

If the JSON is already a list of similarly-shaped objects, `pandas.read_json` handles it directly.


In [3]:
! head data/flat.json

[
  {
    "id": 1,
    "name": "Alice",
    "age": 30,
    "city": "Knoxville"
  },
  {
    "id": 2,
    "name": "Bob",


In [4]:

df_flat = pd.read_json(flat_path)
df_flat

Unnamed: 0,id,name,age,city
0,1,Alice,30,Knoxville
1,2,Bob,27,Nashville
2,3,Carmen,34,Memphis



## 3) Normalizing nested JSON

Use `pandas.json_normalize` with `record_path` for arrays and `meta` for fields to repeat across rows.


In [5]:
df_posts = pd.read_json(nested_path)
df_posts

Unnamed: 0,run_id,source,users
0,2025-09-08T09:45:00Z,api.example/v1/users,"{'user_id': 1, 'profile': {'name': 'Alice', 'e..."
1,2025-09-08T09:45:00Z,api.example/v1/users,"{'user_id': 2, 'profile': {'name': 'Bob', 'ema..."


In [6]:
data = json.loads(nested_path.read_text())
data

{'run_id': '2025-09-08T09:45:00Z',
 'source': 'api.example/v1/users',
 'users': [{'user_id': 1,
   'profile': {'name': 'Alice', 'email': 'alice@example.com'},
   'posts': [{'post_id': 101, 'likes': 5, 'tags': ['intro', 'welcome']},
    {'post_id': 102, 'likes': 7, 'tags': ['data', 'json']}]},
  {'user_id': 2,
   'profile': {'name': 'Bob', 'email': None},
   'posts': [{'post_id': 201, 'likes': 1, 'tags': []}],
   'plan': {'tier': 'pro', 'renewal_date': '2025-12-31'}}]}

In [7]:
# Flatten posts as the record path, bring along user-level and run-level metadata
df_posts = pd.json_normalize(
    data,
    record_path=['users', 'posts'],
    meta=[
        ['users', 'user_id'],
        ['users', 'profile', 'name'],
        ['users', 'profile', 'email'],
        ['users', 'plan', 'tier'],
        ['users', 'plan', 'renewal_date'],
        'run_id', 'source'
    ],
    errors='ignore'
)
df_posts

Unnamed: 0,post_id,likes,tags,users.user_id,users.profile.name,users.profile.email,users.plan.tier,users.plan.renewal_date,run_id,source
0,101,5,"[intro, welcome]",1,Alice,alice@example.com,,,2025-09-08T09:45:00Z,api.example/v1/users
1,102,7,"[data, json]",1,Alice,alice@example.com,,,2025-09-08T09:45:00Z,api.example/v1/users
2,201,1,[],2,Bob,,pro,2025-12-31,2025-09-08T09:45:00Z,api.example/v1/users



### Alternative strategy: explode

Normalize parent list to rows, then expand nested lists with `explode`.


In [8]:
df_users = pd.json_normalize(data['users'])
df_users

Unnamed: 0,user_id,posts,profile.name,profile.email,plan.tier,plan.renewal_date
0,1,"[{'post_id': 101, 'likes': 5, 'tags': ['intro'...",Alice,alice@example.com,,
1,2,"[{'post_id': 201, 'likes': 1, 'tags': []}]",Bob,,pro,2025-12-31


In [9]:
# Explode posts list
df_users_exploded = df_users.explode('posts', ignore_index=True)
df_users_exploded

Unnamed: 0,user_id,posts,profile.name,profile.email,plan.tier,plan.renewal_date
0,1,"{'post_id': 101, 'likes': 5, 'tags': ['intro',...",Alice,alice@example.com,,
1,1,"{'post_id': 102, 'likes': 7, 'tags': ['data', ...",Alice,alice@example.com,,
2,2,"{'post_id': 201, 'likes': 1, 'tags': []}",Bob,,pro,2025-12-31


In [10]:
# Expand dicts inside 'posts' into columns 
posts_cols = pd.json_normalize(df_users_exploded['posts']).add_prefix('post.')
posts_cols

Unnamed: 0,post.post_id,post.likes,post.tags
0,101,5,"[intro, welcome]"
1,102,7,"[data, json]"
2,201,1,[]


In [11]:
# join back
df_users_posts = df_users_exploded.drop(columns=['posts']).join(posts_cols)
df_users_posts

Unnamed: 0,user_id,profile.name,profile.email,plan.tier,plan.renewal_date,post.post_id,post.likes,post.tags
0,1,Alice,alice@example.com,,,101,5,"[intro, welcome]"
1,1,Alice,alice@example.com,,,102,7,"[data, json]"
2,2,Bob,,pro,2025-12-31,201,1,[]


In [12]:
# Bring run-level metadata
df_users_posts['run_id'] = data['run_id']
df_users_posts['source'] = data['source']

df_users_posts

Unnamed: 0,user_id,profile.name,profile.email,plan.tier,plan.renewal_date,post.post_id,post.likes,post.tags,run_id,source
0,1,Alice,alice@example.com,,,101,5,"[intro, welcome]",2025-09-08T09:45:00Z,api.example/v1/users
1,1,Alice,alice@example.com,,,102,7,"[data, json]",2025-09-08T09:45:00Z,api.example/v1/users
2,2,Bob,,pro,2025-12-31,201,1,[],2025-09-08T09:45:00Z,api.example/v1/users



## 4) Handling irregular schemas and missing fields

Real JSON often has inconsistent keys. You can:
- Normalize with pandas and let missing keys become `NaN`
- Apply schema alignment or renaming to harmonize variants
- Use `.get()` when building rows manually


In [13]:
mixed = json.loads(records_mixed_path.read_text())
mixed

[{'id': 1, 'name': 'Alice', 'dept': 'Data', 'start_date': '2024-01-15'},
 {'id': 2, 'full_name': 'Bob B.', 'department': 'Analytics'},
 {'identifier': 3, 'name': 'Carmen', 'dept': 'Data', 'status': 'contract'}]

In [14]:
# raw normalize
df_mixed = pd.json_normalize(mixed)
df_mixed

Unnamed: 0,id,name,dept,start_date,full_name,department,identifier,status
0,1.0,Alice,Data,2024-01-15,,,,
1,2.0,,,,Bob B.,Analytics,,
2,,Carmen,Data,,,,3.0,contract


In [15]:
# Harmonize variants into a consistent schema
rename_map = {
    'full_name': 'name',
    'department': 'dept',
    'identifier': 'id'
}
df_harmonized = df_mixed.rename(columns=rename_map)
df_harmonized

Unnamed: 0,id,name,dept,start_date,name.1,dept.1,id.1,status
0,1.0,Alice,Data,2024-01-15,,,,
1,2.0,,,,Bob B.,Analytics,,
2,,Carmen,Data,,,,3.0,contract


In [16]:
df_harmonized['name']

Unnamed: 0,name,name.1
0,Alice,
1,,Bob B.
2,Carmen,


In [17]:
merged = pd.DataFrame(index=df_harmonized.index)

for col in df_harmonized.columns.unique():
    cols = df_harmonized.loc[:, df_harmonized.columns == col]
    # collapse all duplicates left to right
    series = cols.iloc[:, 0]
    for j in range(1, cols.shape[1]):
        series = series.combine_first(cols.iloc[:, j])
    merged[col] = series

df_harmonized = merged

df_harmonized

Unnamed: 0,id,name,dept,start_date,status
0,1.0,Alice,Data,2024-01-15,
1,2.0,Bob B.,Analytics,,
2,3.0,Carmen,Data,,contract



## 5) Type conversion and dates

Convert strings to numeric or datetime after normalization.


In [18]:
# Example: convert likes to int, dates to datetime
df_posts['post_id'] = pd.to_numeric(df_posts['post_id'], downcast='integer', errors='coerce')
df_posts['likes'] = pd.to_numeric(df_posts['likes'], downcast='integer', errors='coerce')
df_harmonized['id'] = pd.to_numeric(df_harmonized['id'], downcast='integer' , errors='coerce')
df_harmonized['start_date'] = pd.to_datetime(df_harmonized['start_date'], errors='coerce')
df_posts.dtypes, df_harmonized.dtypes

(post_id                     int16
 likes                        int8
 tags                       object
 users.user_id              object
 users.profile.name         object
 users.profile.email        object
 users.plan.tier            object
 users.plan.renewal_date    object
 run_id                     object
 source                     object
 dtype: object,
 id                      int8
 name                  object
 dept                  object
 start_date    datetime64[ns]
 status                object
 dtype: object)


## 6) Line-delimited JSON (NDJSON) and chunked processing

Use `lines=True` for NDJSON. For large files, process in chunks.


In [19]:
# Read all at once
df_events = pd.read_json(ndjson_path, lines=True)
df_events

Unnamed: 0,event_id,ts,type,page,selector
0,e1,2025-09-01T12:00:00Z,login,,
1,e2,2025-09-01T12:05:00Z,view,/home,
2,e3,2025-09-01T12:06:00Z,click,,#cta


In [20]:
# Simulate chunked processing
print("\nProcess in chunks of size 1:")
for chunk in pd.read_json(ndjson_path, lines=True, chunksize=1):
    # Example transformation
    chunk['ts'] = pd.to_datetime(chunk['ts'], utc=True, errors='coerce')
    display(chunk)


Process in chunks of size 1:


Unnamed: 0,event_id,ts,type
0,e1,2025-09-01 12:00:00+00:00,login


Unnamed: 0,event_id,ts,type,page
1,e2,2025-09-01 12:05:00+00:00,view,/home


Unnamed: 0,event_id,ts,type,selector
2,e3,2025-09-01 12:06:00+00:00,click,#cta



## 7) XML parsing with ElementTree

Strategy:
1. Parse the tree
2. Iterate through `user` nodes
3. Extract attributes and child text
4. Expand nested `post` elements to rows


In [21]:
! head -20 ./data/users_posts.xml

<?xml version="1.0" encoding="UTF-8"?>
<users>
  <user id="1">
    <name>Alice</name>
    <contact>
      <email>alice@example.com</email>
    </contact>
    <posts>
      <post id="101" likes="5">
        <tags>
          <tag>intro</tag>
          <tag>welcome</tag>
        </tags>
      </post>
      <post id="102" likes="7">
        <tags>
          <tag>data</tag>
          <tag>json</tag>
        </tags>
      </post>


In [22]:
tree = ET.parse(users_xml_path)

root = tree.getroot()

rows = []
for user in root.findall('user'):
    user_id = user.get('id')
    name_el = user.find('name')
    email_el = user.find('contact/email')
    plan_el = user.find('plan')

    user_name = name_el.text if name_el is not None else None
    user_email = email_el.text if email_el is not None else None
    plan_tier = plan_el.get('tier') if plan_el is not None else None
    plan_renewal = plan_el.get('renewal_date') if plan_el is not None else None

    posts = user.findall('posts/post')
    if posts:
        for post in posts:
            post_id = post.get('id')
            likes = post.get('likes')
            # Collect tag texts if any
            tag_nodes = post.findall('tags/tag')
            tags = [t.text for t in tag_nodes] if tag_nodes else []
            rows.append({
                'user.id': user_id,
                'user.name': user_name,
                'user.email': user_email,
                'plan.tier': plan_tier,
                'plan.renewal_date': plan_renewal,
                'post.id': post_id,
                'post.likes': int(likes) if likes is not None else None,
                'post.tags': tags
            })
    else:
        # If no posts, still record the user row
        rows.append({
            'user.id': user_id,
            'user.name': user_name,
            'user.email': user_email,
            'plan.tier': plan_tier,
            'plan.renewal_date': plan_renewal,
            'post.id': None,
            'post.likes': None,
            'post.tags': []
        })

df_xml = pd.DataFrame(rows)
df_xml

Unnamed: 0,user.id,user.name,user.email,plan.tier,plan.renewal_date,post.id,post.likes,post.tags
0,1,Alice,alice@example.com,,,101,5,"[intro, welcome]"
1,1,Alice,alice@example.com,,,102,7,"[data, json]"
2,2,Bob,,pro,2025-12-31,201,1,[]



## 8) Expanding tags and exploding lists

Use `explode` to turn lists into rows.


In [23]:

df_xml_expanded = df_xml.explode('post.tags', ignore_index=True)
df_xml_expanded

Unnamed: 0,user.id,user.name,user.email,plan.tier,plan.renewal_date,post.id,post.likes,post.tags
0,1,Alice,alice@example.com,,,101,5,intro
1,1,Alice,alice@example.com,,,101,5,welcome
2,1,Alice,alice@example.com,,,102,7,data
3,1,Alice,alice@example.com,,,102,7,json
4,2,Bob,,pro,2025-12-31,201,1,



## 9) Namespaces and robustness

Real XML often includes namespaces, like `<ns:tag xmlns:ns="http://example.com/ns">`.
Use fully-qualified names or register namespaces when searching.


In [24]:

# Example pattern for namespaces (demonstration only, not applied to our sample):
# ns = {'ns': 'http://example.com/ns'}
# for el in root.findall('ns:record/ns:item', namespaces=ns):
#     ...

print("When you see prefixes like ns:, define a namespace map and use it in find/findall.")

When you see prefixes like ns:, define a namespace map and use it in find/findall.



## 10) Exporting cleaned tables

Export normalized tables as CSV or Parquet for downstream analysis.


In [25]:

out_dir = DATA_DIR / 'outputs'
out_dir.mkdir(exist_ok=True)

df_flat.to_csv(out_dir / 'flat.csv', index=False)
df_posts.to_csv(out_dir / 'posts_from_json.csv', index=False)
df_users_posts.to_csv(out_dir / 'users_posts_from_json.csv', index=False)
df_harmonized.to_csv(out_dir / 'records_harmonized.csv', index=False)
df_events.to_csv(out_dir / 'events.csv', index=False)
df_xml.to_csv(out_dir / 'xml_users_posts.csv', index=False)
df_xml_expanded.to_csv(out_dir / 'xml_users_posts_exploded.csv', index=False)

list(out_dir.iterdir())

[PosixPath('data/outputs/posts_from_json.csv'),
 PosixPath('data/outputs/xml_users_posts.csv'),
 PosixPath('data/outputs/users_posts_from_json.csv'),
 PosixPath('data/outputs/records_harmonized.csv'),
 PosixPath('data/outputs/xml_users_posts_exploded.csv'),
 PosixPath('data/outputs/flat.csv'),
 PosixPath('data/outputs/events.csv')]


## Summary
- Flat JSON → `read_json`
- Nested JSON → `json_normalize` or explode strategy
- Irregular schema → harmonize columns post-normalization
- NDJSON → `lines=True`, use `chunksize` for large files
- XML → parse tree, iterate elements, expand nested items
- Export clean tables for analysis
