In [42]:
import pandas as pd
import json

# Parsing nested JSON

When working with JSON data in Python, it's common to encounter nested structures. 
But pandas DataFrame are inherently two-dimensional, so we must flatten nested JSON objects into a flat table.

We can approach this by:
- python's list comprehensions or map() function
- [pd.json_normalize()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) function which normalize semi-structured JSON data into a flat table.

## Example

Let's have next JSON data:

In [43]:
json_data = '''
{
  "employees": [
    {
      "name": "Ivan Ivanov",
      "email": "ivan@example.com",
      "job": {
        "title": "Analyst",
        "department": "Finance"
      }
    },
    {
      "name": "Maria Popova",
      "email": "maria@example.com",
      "job": {
        "title": "Engineer",
        "department": "Development"
      }
    }
  ]
}
'''

### Manual parsing

In [44]:
raw_data = json.loads(json_data)

data = [
    [r['name'],r['email'],r['job']['title'], r['job']['department']]
    for r in raw_data['employees']
]
columns = ['name', 'email', 'job_title', 'department']
df = pd.DataFrame(data, columns=columns)
df

Unnamed: 0,name,email,job_title,department
0,Ivan Ivanov,ivan@example.com,Analyst,Finance
1,Maria Popova,maria@example.com,Engineer,Development


### With pd.json_normalize()

In [45]:
data = json.loads(json_data)
df = pd.json_normalize(
    data['employees'],
    sep='_'
)
df

Unnamed: 0,name,email,job_title,job_department
0,Ivan Ivanov,ivan@example.com,Analyst,Finance
1,Maria Popova,maria@example.com,Engineer,Development


## Using pd.json_normalize()

### Overview
`pd.json_normalize()` offers a bridge to flatten nested JSON objects into a flat table.

Key parametrs are:

- **data**: The input data to normalize. This can be a list of dicts, a dict of lists, or a deep nested dict.
- **record_path**: Specifies the path to the data within a nested JSON structure that you want to flatten.
- **meta**: Additional data to be included at the top level of each record. Useful for including metadata.
- **meta_prefix**: A prefix for the meta fields, helping to avoid column name collisions and clarify the DataFrame's structure.
- **record_prefix**: A prefix for the fields from the record_path, distinguishing them from other data in the DataFrame.
- **errors**: Controls how to handle missing keys in nested data. Options include 'raise', 'ignore', and 'default'.
- **sep**: Specifies the separator used when joining nested keys. The default is '.'.

### Example 1: Basic Usage

The basic use case of pd.json_normalize involves flattening a simple nested JSON object into a pandas DataFrame. 

Consider the following example with a list of people, where each person has basic information nested under an info key.

In [54]:
data = [
    {"id": 1, "name": "Ivan Petrov", "info": {"age": 28, "city": "Sofia"}},
    {"id": 2, "name": "Maria Georgieva", "info": {"age": 34, "city": "Plovdiv"}},
]

# Flatten the JSON data
df = pd.json_normalize(data)
df


Unnamed: 0,id,name,info.age,info.city
0,1,Ivan Petrov,28,Sofia
1,2,Maria Georgieva,34,Plovdiv


### Example 2: Handling Nested Structures (using record_path)

When dealing with more complex nested structures, you can use the record_path to specify how you want the data to be flattened.

Consider a dataset where each person also has a list of friends, with each friend having their own set of information:

In [134]:
data = [
    {
        "id": 1,
        "name": "Ivan Petrov",
        "info": {
            "age": 28,
            "city": "Sofia",
            "friends": [
                {"name": "Dimitar Dimitrov", "age": 30},
                {"name": "Stefan Ivanov", "age": 26}
            ]
        }
    }
]

# Flatten focusing of "friends" data:
df = pd.json_normalize(
    data = data,
    record_path=["info", "friends"], # Path to the nested list to be expanded.
    errors='ignore'
)
df


Unnamed: 0,name,age
0,Dimitar Dimitrov,30
1,Stefan Ivanov,26


### Example 3: Handling Nested Structures (using record_path, meta and prefixes)



In [145]:
data = [
    {
        "id": 1,
        "name": "Ivan Petrov",
        "info": {
            "age": 28,
            "city": "Sofia",
            "friends": [
                {"name": "Dimitar Dimitrov", "age": 30},
                {"name": "Stefan Ivanov", "age": 26}
            ]
        }
    }
]

df = pd.json_normalize(
    data,
    record_path=["info", "friends"],  # Path to the nested list to be expanded.
    meta=[
        "name",  # Top-level field to include as metadata.
        ["info", "age"],  # Nested field to include as metadata, specified with its path.
    ],
    record_prefix='friend_',  # Prefix for the columns resulting from the record_path's list.
    meta_prefix="person_",  # Prefix for the columns resulting from the metadata to clearly differentiate them.
    sep='_',  # Separator used between nested names and prefixes, ensuring clarity in the resulting column names.
    errors='ignore'  # If any specified meta path is not found in a record, the function will ignore this and proceed without an error.
)

### Manualy reorder the columns:
columns_ordered = ['person_name', 'person_age', 'friend_name', 'friend_age']
# rename person_info_age:
df = df.rename(columns={
    'person_info_age': 'person_age'
})

# now,reorder the columns
df = df[columns_ordered]

df


Unnamed: 0,person_name,person_age,friend_name,friend_age
0,Ivan Petrov,28,Dimitar Dimitrov,30
1,Ivan Petrov,28,Stefan Ivanov,26


The manual approach (using list comprehensions) is more flexible. 
We can explicitelly specify the which data to be extracted and their order in resulting table:

In [146]:
data = [
    {
        "id": 1,
        "name": "Ivan Petrov",
        "info": {
            "age": 28,
            "city": "Sofia",
            "friends": [
                {"name": "Dimitar Dimitrov", "age": 30},
                {"name": "Stefan Ivanov", "age": 26}
            ]
        }
    }
]

rows = [
        [row['name'], row['info']['age'], friend['name'], friend['age']]
            for row in data
                for friend in row['info']['friends']
]
columns = ['person_name', 'person_age', 'friend_name', 'friend_age']

df = pd.DataFrame(rows, columns=columns)
df

Unnamed: 0,person_name,person_age,friend_name,friend_age
0,Ivan Petrov,28,Dimitar Dimitrov,30
1,Ivan Petrov,28,Stefan Ivanov,26
