<font style='font-size:1.5em'>**🧑‍🏫 Week 07 Lecture</font><br>
<font style='font-size:1.3em;color:#888888'>Normalising JSON + the Groupby -> Apply -> Combine Strategy** </font>

<font style='font-size:1.2em'>LSE [DS105A](https://lse-dsi.github.io/DS105/autumn-term/index.html){style="color:#e26a4f;font-weight:bold"} – Data for Data Science (2024/25) </font>



<div style="color: #333333; background-color:rgba(226, 106, 79, 0.075); border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px 0 20px 10px; margin: 10px 0 10px 0; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 350px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;">

🗓️ **DATE:** 14 November 2024 

⌚ **TIME:** 16.00-18.00

📍 **LOCATION:** CLM.5.02
</div>


**AUTHORS:**  Dr. [Jon Cardoso-Silva](https://jonjoncardoso.github.io){style="color:#e26a4f;font-weight:bold"}

**DEPARTMENT:** [LSE Data Science Institute](https://lse.ac.uk/dsi){style="color:#e26a4f;font-weight:bold"}

**OBJECTIVE**: Demonstrate how to 'disentangle' complex JSON data structures using the `json_normalize` function from the `pandas` library and introduce the `groupby -> apply -> combine` strategy to process data in a more efficient way than using loops. We will also discuss the `explode` function to handle cases when we find ourselves with columns made out of lists.

**REFERENCES:**

- The [`pd.json_normalize()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) to convert JSON data more easily into tabular format
- The [DataFrame.explode()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html) function to handle cases when columns are made out of lists
- The [DataFrame.groupby()](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) function, combined with apply() and agg() to aggregate data

---

In [2]:
import pandas as pd

# 1. Flat vs Nested JSON

You spent the past few weeks playing with data collected from OpenMeteo, a (mostly) free API that provides weather data for any location in the world in the format of JSON. JSON is indeed the preferred format for APIs, as it is easy to read and write by both humans and machines and easy to be parsed by any programming language. 

👉 However, most data analysis libraries, such as `pandas` in Python or `dplyr` in R or SQL, as well as most visualisation libraries, are designed to work with tabular data. This means that we need to convert JSON data into a tabular format to be able to analyse it.

OpenMeteo's JSON is overall fairly straightforward. If you just used a single location and a single temporal resolution (either `daily` or `hourly`), the JSON output was mostly "flat" and could be easily converted into a DataFrame. 

<div style="display: flex; flex-wrap: wrap; flex-direction:row;justify-content: left; margin: 0.5em;font-size:0.9em">

<div style="color: #333333; background-color:#ffffff; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px 0 20px 10px; margin: 10px 0 10px 0; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 350px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;">

**What is considered a "flat" JSON?**

A "flat" JSON is one where the keys are all at the same level, and the values are either atomic (strings, numbers, booleans) or lists of atomic values.

For example:

```json
{
    "key1": "value1",
    "key2": "value2",
    "key3": [1, 2, 3]
}
```

</div>

<div style="color: #333333; background-color:#ffffff; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px 0 20px 10px; margin: 10px 0 10px 0; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 350px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;margin-left:2em">

**What is considered a "nested" JSON?**

A "nested" JSON is one where the keys are at different levels, you have dictionaries within dictionaries, or lists of dictionaries.

For example:

```json
{
    "key1": "value1",
    "key2": {
        "key3": "value3",
        "key4": "value4"
    },
    "key5": [
        {"key6": "value6"},
        {"key7": "value7"}
    ]
}
```

</div>

</div>

[🤔 **Think about it:** What is the easiest way to convert JSON data into a tabular format using Python?]{style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;"}

To answer that question, let's first understand the structure of the JSON data we are working with. I will use OpenMeteo first as an example. 

## 1.1 Let's look at a familiar example

What follows below is the response I get after requesting historical weather data from OpenMeteo using the following parameters:

| Variable             | Value                                                                 |
|----------------------|-----------------------------------------------------------------------|
| **Latitude**             | 51.493847                                                             |
| **Longitude**            | -0.1630249                                                            |
| **Period**               | from 26/Oct/2024 until 09/Nov/2024 <br> (when it was super grey in London!)      |
| **Frequency**            | daily                                                                 |
| **Weather variables**    | weather code, daylight duration, sunshine duration                    |

I copied the output and stored it as a dictionary to save us the hassle of sending a request:

In [38]:
json_one_location = {
  "latitude": 51.493847,
  "longitude": -0.1630249,
  "generationtime_ms": 0.154972076416016,
  "utc_offset_seconds": 0,
  "timezone": "GMT",
  "timezone_abbreviation": "GMT",
  "elevation": 23,
  "daily_units": {
    "time": "iso8601",
    "weather_code": "wmo code",
    "daylight_duration": "s",
    "sunshine_duration": "s"
  },
  "daily": {
    "time": [
      "2024-10-26",
      "2024-10-27",
      "2024-10-28",
      "2024-10-29",
      "2024-10-30",
      "2024-10-31",
      "2024-11-01",
      "2024-11-02",
      "2024-11-03",
      "2024-11-04",
      "2024-11-05",
      "2024-11-06",
      "2024-11-07",
      "2024-11-08",
      "2024-11-09"
    ],
    "weather_code": [3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51, 3],
    "daylight_duration": [35989.18, 35766.05, 35543.86, 35322.81, 35103.12, 34885.01, 34668.7, 34454.44, 34242.45, 34032.99, 33826.3, 33622.64, 33422.29, 33225.15, 33029.27],
    "sunshine_duration": [14002.7, 31127.43, 5811.8, 10787.3, 19920.14, 14421.7, 14437.86, 0, 1098.29, 8162.58, 17941.69, 536.17, 4472.84, 0, 7173.36]
  }
}

[🤔 **Think about it:** Is this a flat or nested JSON object?]{style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;"}

What about the output of OpenMeteo when I request data for two locations at once?

👇

In [37]:
json_two_locations = [
  {
    "latitude": 51.493847,
    "longitude": -0.1630249,
    "generationtime_ms": 0.200033187866211,
    "utc_offset_seconds": 0,
    "timezone": "GMT",
    "timezone_abbreviation": "GMT",
    "elevation": 23,
    "daily_units": {
      "time": "iso8601",
      "weather_code": "wmo code",
      "daylight_duration": "s",
      "sunshine_duration": "s"
    },
    "daily": {
      "time": [
        "2024-10-26",
        "2024-10-27",
        "2024-10-28",
        "2024-10-29",
        "2024-10-30",
        "2024-10-31",
        "2024-11-01",
        "2024-11-02",
        "2024-11-03",
        "2024-11-04",
        "2024-11-05",
        "2024-11-06",
        "2024-11-07",
        "2024-11-08",
        "2024-11-09"
      ],
      "weather_code": [3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51, 3],
      "daylight_duration": [35989.18, 35766.05, 35543.86, 35322.81, 35103.12, 34885.01, 34668.7, 34454.44, 34242.45, 34032.99, 33826.3, 33622.64, 33422.29, 33225.15, 33029.27],
      "sunshine_duration": [14002.7, 31127.43, 5811.8, 10787.3, 19920.14, 14421.7, 14437.86, 0, 1098.29, 8162.58, 17941.69, 536.17, 4472.84, 0, 7173.36]
    }
  },
  {
    "latitude": 48.822495,
    "longitude": 2.2881355,
    "generationtime_ms": 0.160098075866699,
    "utc_offset_seconds": 0,
    "timezone": "GMT",
    "timezone_abbreviation": "GMT",
    "elevation": 43,
    "location_id": 1,
    "daily_units": {
      "time": "iso8601",
      "weather_code": "wmo code",
      "daylight_duration": "s",
      "sunshine_duration": "s"
    },
    "daily": {
      "time": [
        "2024-10-26",
        "2024-10-27",
        "2024-10-28",
        "2024-10-29",
        "2024-10-30",
        "2024-10-31",
        "2024-11-01",
        "2024-11-02",
        "2024-11-03",
        "2024-11-04",
        "2024-11-05",
        "2024-11-06",
        "2024-11-07",
        "2024-11-08",
        "2024-11-09"
      ],
      "weather_code": [55, 53, 3, 3, 3, 3, 3, 51, 3, 3, 3, 3, 3, 3, 3],
      "daylight_duration": [36679.21, 36477.63, 36276.98, 36077.46, 35879.25, 35682.56, 35487.58, 35294.54, 35103.64, 34915.11, 34729.17, 34546.06, 34366.02, 34188.94, 34013.09],
      "sunshine_duration": [17721.39, 1329.73, 14961.37, 3869.25, 13851.97, 17390.9, 130.73, 0, 17079.12, 17671.44, 0, 9888.03, 0, 0, 10666.35]
    }
  }
]

👨🏻‍🏫 **TEACHING MOMENT:** Watch me as I demonstrate how to browse through the JSON data on VSCode.

## 1.2 Flat JSON makes for easy conversion to tabular format

I can easily convert a flat dictionary (similarly, a flat JSON object) into a DataFrame using the `pd.DataFrame()` function:

In [47]:
# A flat dictionary
flat_dict = {
    "latitude": 51.493847,
    "longitude": -0.1630249,
    "generationtime_ms": 0.154972076416016,
    "utc_offset_seconds": 0,
    "timezone": "GMT",
    "timezone_abbreviation": "GMT",
    "elevation": 23,
}

As a Series:

In [49]:
pd.Series(flat_dict)

latitude                 51.493847
longitude                -0.163025
generationtime_ms         0.154972
utc_offset_seconds               0
timezone                       GMT
timezone_abbreviation          GMT
elevation                       23
dtype: object

Or as a DataFrame:


In [51]:
# To create a DataFrame with just a single row, you need to specify an index.
pd.DataFrame(flat_dict, index=[0])

Unnamed: 0,latitude,longitude,generationtime_ms,utc_offset_seconds,timezone,timezone_abbreviation,elevation
0,51.493847,-0.163025,0.154972,0,GMT,GMT,23


This is also true for when the values are lists. Take for example the dictionary at `json_output['daily']`:

In [53]:
# The dictionary that is contained at the 'daily' key is a dictionary of lists.
# We also consider this a flat dictionary because the lists are all the same length.
flat_dict = json_one_location['daily']

# This time you don't need to specify an index. 
# Pandas will automatically expand (explode) the lists into multiple rows 
# and assign an increasing integer index.
pd.DataFrame(flat_dict)

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration
0,2024-10-26,3,35989.18,14002.7
1,2024-10-27,51,35766.05,31127.43
2,2024-10-28,51,35543.86,5811.8
3,2024-10-29,51,35322.81,10787.3
4,2024-10-30,3,35103.12,19920.14
5,2024-10-31,3,34885.01,14421.7
6,2024-11-01,3,34668.7,14437.86
7,2024-11-02,51,34454.44,0.0
8,2024-11-03,3,34242.45,1098.29
9,2024-11-04,3,34032.99,8162.58


## 1.3 It gets trickier with nested JSON

What if instead of creating a DataFrame just for the `daily` key, I create a DataFrame for the `json_one_location` object?

In [54]:
pd.DataFrame(json_one_location)

Unnamed: 0,latitude,longitude,generationtime_ms,utc_offset_seconds,timezone,timezone_abbreviation,elevation,daily_units,daily
time,51.493847,-0.163025,0.154972,0,GMT,GMT,23,iso8601,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2..."
weather_code,51.493847,-0.163025,0.154972,0,GMT,GMT,23,wmo code,"[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51..."
daylight_duration,51.493847,-0.163025,0.154972,0,GMT,GMT,23,s,"[35989.18, 35766.05, 35543.86, 35322.81, 35103..."
sunshine_duration,51.493847,-0.163025,0.154972,0,GMT,GMT,23,s,"[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."


📢 **IT IS REALLY IMPORTANT TO PAY ATTENTION TO WHAT IS HAPPENING HERE:**

Pandas will make inferences about the structure of the data and will try to create a DataFrame that makes sense

- The keys that are at the top-level dictionary take priority and become columns:

    - `latitude`

    - `longitude`
    
    - `generationtime_ms`
    
    - etc.

The two top-level keys that are dictionaries (`daily_units` and `daily`) are also converted to columns, but with a twist:

- Each one of these two dictionaries become a column in the DataFrame

- **Each one of the invididual keys of these dictionaries becomes a row**.

   The keys of these internal dictionaries are moved to the **Index**. (`time`, `weathercode`, `daylight_duration`, `sunshine_duration`)

- The values of these dictionaries become the values to put on the rows with the corresponding index levels

Notice also this about these two special columns:

- Because each of the values in the `json_one_location['daily_units']` dictionary are individual primitive data types (strings, integers, floats), they come out as individual values in the DataFrame

- As for the `json_one_location['daily']` dictionary, the values are lists, and they are kept as lists in the DataFrame

Notice yet another important thing:

- The keys of the `json_one_location['daily']` and the `json_one_location['daily_units']` dictionaries are exactly the same four keys:

    - `time`

    - `weathercode`

    - `daylight_duration`

    - `sunshine_duration`

- This means that the values of `json_one_location['daily']` and `json_one_location['daily_units']` are aligned in the DataFrame


[🤔 **Now, think about it:** Why would I want the DataFrame in that format? What is it good for?]{style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;"}

It all depends on what you want to do with the data afterwards.

Because right now, we just have a single location (London's latitude and longitude), this repetition of the location information is redundant, wasteful. 

**We do not want wasteful dataframes in this course! We want to live in a world where data is clean, tidy, economical, and efficient!** 

In that case, we can conclude that we just care about the `daily` key and we can discard the rest of the information. That is, `pd.DataFrame(json_output['daily'])` was the right way to go anyways.

# 2. Unnesting a nested JSON: an exercise in data manipulation

[🤔 **Think about it:** What would it take to convert the DataFrame above to the same format as the nicely shaped `pd.DataFrame(json_output['daily'])`?]{style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;"}

Let me do this step by step:

In [55]:
# Let me store this in a variable so I can reuse it later.
df = pd.DataFrame(json_one_location)

When I look at all of the columns I realise that I just care about the `daily` key.

In [68]:
df.columns

Index(['latitude', 'longitude', 'generationtime_ms', 'utc_offset_seconds',
       'timezone', 'timezone_abbreviation', 'elevation', 'daily_units',
       'daily'],
      dtype='object')

 I can drop the rest of the columns:

## 2.1 Subsetting the DataFrame

**Alternative 01:** Filter the DataFrame so it only contains the columns you want.

In [60]:
selected_columns = ['daily']

df[selected_columns]

Unnamed: 0,daily
time,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2..."
weather_code,"[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51..."
daylight_duration,"[35989.18, 35766.05, 35543.86, 35322.81, 35103..."
sunshine_duration,"[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."


In [61]:
# This does the same thing as the previous cell, but in a single line.
df[['daily']]

Unnamed: 0,daily
time,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2..."
weather_code,"[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51..."
daylight_duration,"[35989.18, 35766.05, 35543.86, 35322.81, 35103..."
sunshine_duration,"[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."


**Alternative 02:** Drop the columns you do not want with the `drop()` method.

In [69]:
invalid_columns = [column for column in df.columns if column != 'daily']

df.drop(columns=invalid_columns)

Unnamed: 0,daily
time,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2..."
weather_code,"[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51..."
daylight_duration,"[35989.18, 35766.05, 35543.86, 35322.81, 35103..."
sunshine_duration,"[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."


Of course, **Alternative 01** makes a lot more sense in this case because we can manually specify the columns we want to keep.

💡 In the future you might find yourself in a situation where you have a DataFrame with hundreds of columns and you want to drop all of them except for a few. In that case, you can use the `drop()` method above.

## 2.2 Transposing the DataFrame

Essentially, we want to transform the **rows into columns**. Helpfully, each row already has a good name suitable for a column name.

Whenever you want to do this, you can use the `transpose()` method:

In [70]:
# You can use the transpose method to swap the rows and columns:
df[['daily']].transpose()

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration
daily,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2...","[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51...","[35989.18, 35766.05, 35543.86, 35322.81, 35103...","[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."


In [71]:
# Or just use T if you prefer:
df[['daily']].T

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration
daily,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2...","[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51...","[35989.18, 35766.05, 35543.86, 35322.81, 35103...","[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."


## 2.3 Exploding lists into separate rows

Now, notice that we ended up with a DataFrame where the columns are made out of lists. You will inevitably find yourself in this situation when working with JSON data every now and then.

The `DataFrame.explode()` method is a great way to handle this situation. It expects to receive the column name (as a string) or names (as a list of strings) that you want to explode out into separate rows.

In [72]:
selected_columns = ['daily']
columns_to_explode = ['time', 'weather_code', 'daylight_duration', 'sunshine_duration']

# Explode the columns that contain lists:
df[selected_columns].T.explode(columns_to_explode)

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration
daily,2024-10-26,3,35989.18,14002.7
daily,2024-10-27,51,35766.05,31127.43
daily,2024-10-28,51,35543.86,5811.8
daily,2024-10-29,51,35322.81,10787.3
daily,2024-10-30,3,35103.12,19920.14
daily,2024-10-31,3,34885.01,14421.7
daily,2024-11-01,3,34668.7,14437.86
daily,2024-11-02,51,34454.44,0.0
daily,2024-11-03,3,34242.45,1098.29
daily,2024-11-04,3,34032.99,8162.58


And that's it! We achieved the same DataFrame as the one we got from `pd.DataFrame(json_output['daily'])`.

Take-home message:

- **Always be mindful of the structure of your data**

- If you have a flat JSON, you can easily convert it into a DataFrame, no need to worry about anything else.

- If you have a slightly nested JSON, you can still convert it into a DataFrame, but you might need to do some extra work to make it more efficient.

# 3. Normalising more complex JSON data

Now, let me complicate things a bit further. Sometimes, you will find yourself with JSON data that is more complex and has more than one level of nesting.

To demonstrate this, I will use the `json_two_locations` object, which contains the weather data for two locations at once.

[🤔 **Think about it:** What would you expect to happen when you convert the `json_two_locations` object into a DataFrame?]{style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;"} 

In [74]:
# Your regular reminder that JSON objects can be either dictionaries or lists.
type(json_two_locations)

list

What happens when I try to create a DataFrame for the entire JSON response?

In [75]:
pd.DataFrame(json_two_locations)

Unnamed: 0,latitude,longitude,generationtime_ms,utc_offset_seconds,timezone,timezone_abbreviation,elevation,daily_units,daily,location_id
0,51.493847,-0.163025,0.200033,0,GMT,GMT,23,"{'time': 'iso8601', 'weather_code': 'wmo code'...","{'time': ['2024-10-26', '2024-10-27', '2024-10...",
1,48.822495,2.288136,0.160098,0,GMT,GMT,43,"{'time': 'iso8601', 'weather_code': 'wmo code'...","{'time': ['2024-10-26', '2024-10-27', '2024-10...",1.0


⚡ **OH NO!** This time around the code we built above will not work. When I select just the `daily` column, I no longer have those neat index names that I can use as column names:

In [113]:
# If you try to transpose after this line, things will get messy
pd.DataFrame(json_two_locations)[selected_columns]

Unnamed: 0,daily
0,"{'time': ['2024-10-26', '2024-10-27', '2024-10..."
1,"{'time': ['2024-10-26', '2024-10-27', '2024-10..."


## 3.1 The long and winding road (NOT A GOOD PRACTICE -- BUT IT WORKS)

If your Internet is down and you can't look up any documentation, Google or ChatGPT, then you can always take the long way around and use `for` loops to rebuild the DataFrame.

After all, in case of panic, **DataFrames can always be converted to a mix of list and dictionaries**:

### Converting a DataFrame back to 'pure Python'

Using the method below, each row becomes a list of dictionaries.

In [124]:
# I will save that to a variable so I can use it later.
output = pd.DataFrame(json_two_locations)[selected_columns].values.tolist()

len(output)

2

In [131]:
print(f"The first element is a {type(output[0])} and the second element is also a {type(output[1])}.")
print(f"The first element has {len(output[0])} element and the second element has {len(output[1])} element.")

The first element is a <class 'list'> and the second element is also a <class 'list'>.
The first element has 1 element and the second element has 1 element.


### Rebuilding by concatenating manually

Therefore, I could concatenate all of the lists into a single list and then convert it into a DataFrame:

In [132]:
pd.concat([pd.DataFrame(output[0]), pd.DataFrame(output[1])])

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration
0,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2...","[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51...","[35989.18, 35766.05, 35543.86, 35322.81, 35103...","[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."
0,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2...","[55, 53, 3, 3, 3, 3, 3, 51, 3, 3, 3, 3, 3, 3, 3]","[36679.21, 36477.63, 36276.98, 36077.46, 35879...","[17721.39, 1329.73, 14961.37, 3869.25, 13851.9..."


### Exploding so we have a conventional, unnested DataFrame

After which I could try to explode this DataFrame as we did before.

In [133]:
pd.concat([pd.DataFrame(output[0]), pd.DataFrame(output[1])]).explode(columns_to_explode)

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration
0,2024-10-26,3,35989.18,14002.7
0,2024-10-27,51,35766.05,31127.43
0,2024-10-28,51,35543.86,5811.8
0,2024-10-29,51,35322.81,10787.3
0,2024-10-30,3,35103.12,19920.14
0,2024-10-31,3,34885.01,14421.7
0,2024-11-01,3,34668.7,14437.86
0,2024-11-02,51,34454.44,0.0
0,2024-11-03,3,34242.45,1098.29
0,2024-11-04,3,34032.99,8162.58


**Success?! Not quite.** As the 'daily' key didn't have any information about the location, we lost that information.

Unlike the previous example, where we had a single location, we now have two locations and we need to keep track of which row belongs to which location.

### Manually adding the location information (DEFINITELY NOT A GOOD PRACTICE!!!)

While the code below works, this is very error prone. It requires a lot of faith on the part of the programmer that the data is correct and that the code is doing what it is supposed to do.

You've been coding for a few weeks now, you are now probably familiar with the concept of 'oops, I forgot to add a comma here' or 'I forgot to add a plus sign there'. Imagine how much more complex it is to keep track of all of the different variables and the different data types.

**AVOID MANUALLY ADDING DATA TO DATAFRAMES AT ALL COSTS!**

In [134]:
df = pd.concat([pd.DataFrame(output[0]), pd.DataFrame(output[1])]).explode(columns_to_explode)

print(f"The DataFrame has {df.shape[0]} rows and {df.shape[1]} columns.")

The DataFrame has 30 rows and 4 columns.


I know that the first half of the DataFrame belongs to the first location and the second half belongs to the second location.

I also know (I think?!) that the first Location is London and the second Location is Paris.

I can manually add this information to the DataFrame:

In [135]:
# This hurts my eyes. Never use hard-coded values like this.
df['location_id'] = ['London'] * 15 + ['Paris'] * 15

In [137]:
# It's a bittersweet victory:
df

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration,location_id
0,2024-10-26,3,35989.18,14002.7,London
0,2024-10-27,51,35766.05,31127.43,London
0,2024-10-28,51,35543.86,5811.8,London
0,2024-10-29,51,35322.81,10787.3,London
0,2024-10-30,3,35103.12,19920.14,London
0,2024-10-31,3,34885.01,14421.7,London
0,2024-11-01,3,34668.7,14437.86,London
0,2024-11-02,51,34454.44,0.0,London
0,2024-11-03,3,34242.45,1098.29,London
0,2024-11-04,3,34032.99,8162.58,London


# 2. Meet the `json_normalize()` function

In [160]:
(
    pd.json_normalize(json_output_two_cities)
        .explode(['daily.time', 'daily.weather_code', 'daily.daylight_duration', 'daily.sunshine_duration'])
        .sort_values(['daily.time', 'latitude', 'longitude'])
        .set_index(['daily.time', 'latitude', 'longitude'])
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,generationtime_ms,utc_offset_seconds,timezone,timezone_abbreviation,elevation,daily_units.time,daily_units.weather_code,daily_units.daylight_duration,daily_units.sunshine_duration,daily.weather_code,daily.daylight_duration,daily.sunshine_duration,location_id
daily.time,latitude,longitude,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2024-10-26,48.822495,2.288136,0.160098,0,GMT,GMT,43,iso8601,wmo code,s,s,55,36679.21,17721.39,1.0
2024-10-26,51.493847,-0.163025,0.200033,0,GMT,GMT,23,iso8601,wmo code,s,s,3,35989.18,14002.7,
2024-10-27,48.822495,2.288136,0.160098,0,GMT,GMT,43,iso8601,wmo code,s,s,53,36477.63,1329.73,1.0
2024-10-27,51.493847,-0.163025,0.200033,0,GMT,GMT,23,iso8601,wmo code,s,s,51,35766.05,31127.43,
2024-10-28,48.822495,2.288136,0.160098,0,GMT,GMT,43,iso8601,wmo code,s,s,3,36276.98,14961.37,1.0
2024-10-28,51.493847,-0.163025,0.200033,0,GMT,GMT,23,iso8601,wmo code,s,s,51,35543.86,5811.8,
2024-10-29,48.822495,2.288136,0.160098,0,GMT,GMT,43,iso8601,wmo code,s,s,3,36077.46,3869.25,1.0
2024-10-29,51.493847,-0.163025,0.200033,0,GMT,GMT,23,iso8601,wmo code,s,s,51,35322.81,10787.3,
2024-10-30,48.822495,2.288136,0.160098,0,GMT,GMT,43,iso8601,wmo code,s,s,3,35879.25,13851.97,1.0
2024-10-30,51.493847,-0.163025,0.200033,0,GMT,GMT,23,iso8601,wmo code,s,s,3,35103.12,19920.14,


In [40]:
(
    pd.DataFrame(json_output)
        .explode('daily')
        .reset_index(names=['variable'])
)

Unnamed: 0,variable,latitude,longitude,generationtime_ms,utc_offset_seconds,timezone,timezone_abbreviation,elevation,daily_units,daily
0,time,51.493847,-0.163025,0.154972,0,GMT,GMT,23,iso8601,2024-10-26
1,time,51.493847,-0.163025,0.154972,0,GMT,GMT,23,iso8601,2024-10-27
2,time,51.493847,-0.163025,0.154972,0,GMT,GMT,23,iso8601,2024-10-28
3,time,51.493847,-0.163025,0.154972,0,GMT,GMT,23,iso8601,2024-10-29
4,time,51.493847,-0.163025,0.154972,0,GMT,GMT,23,iso8601,2024-10-30
5,time,51.493847,-0.163025,0.154972,0,GMT,GMT,23,iso8601,2024-10-31
6,time,51.493847,-0.163025,0.154972,0,GMT,GMT,23,iso8601,2024-11-01
7,time,51.493847,-0.163025,0.154972,0,GMT,GMT,23,iso8601,2024-11-02
8,time,51.493847,-0.163025,0.154972,0,GMT,GMT,23,iso8601,2024-11-03
9,time,51.493847,-0.163025,0.154972,0,GMT,GMT,23,iso8601,2024-11-04
