<font style='font-size:1.5em'>**🧑‍🏫 Week 07 Lecture</font><br>
<font style='font-size:1.3em;color:#888888'>Normalising JSON + the Groupby -> Apply -> Combine Strategy** </font>

<font style='font-size:1.2em'>LSE [DS105A](https://lse-dsi.github.io/DS105/autumn-term/index.html){style="color:#e26a4f;font-weight:bold"} – Data for Data Science (2024/25) </font>



<div style="color: #333333; background-color:rgba(226, 106, 79, 0.075); border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px 0 20px 10px; margin: 10px 0 10px 0; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 350px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;">

🗓️ **DATE:** 14 November 2024 

⌚ **TIME:** 16.00-18.00

📍 **LOCATION:** CLM.5.02
</div>


**AUTHORS:**  Dr. [Jon Cardoso-Silva](https://jonjoncardoso.github.io){style="color:#e26a4f;font-weight:bold"}

**DEPARTMENT:** [LSE Data Science Institute](https://lse.ac.uk/dsi){style="color:#e26a4f;font-weight:bold"}

**OBJECTIVE**: Demonstrate how to 'disentangle' complex JSON data structures using the `json_normalize` function from the `pandas` library and introduce the `groupby -> apply -> combine` strategy to process data in a more efficient way than using loops. We will also discuss the `explode` function to handle cases when we find ourselves with columns made out of lists.

**REFERENCES:**

- The [`pd.json_normalize()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) to convert JSON data more easily into tabular format

- The [DataFrame.explode()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html) function to handle cases when columns are made out of lists

In the labs later (second notebook), we will also cover:

- The [DataFrame.groupby()](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) function, combined with apply() and agg() to aggregate data 

---

In [1]:
import pandas as pd

# 1. Flat vs Nested JSON

You spent the past few weeks playing with data collected from OpenMeteo, a (mostly) free API that provides weather data for any location in the world in the format of JSON. JSON is indeed the preferred format for APIs, as it is easy to read and write by both humans and machines and easy to be parsed by any programming language. 

👉 However, most data analysis libraries, such as `pandas` in Python or `dplyr` in R or SQL, as well as most visualisation libraries, are designed to work with tabular data. This means that we need to convert JSON data into a tabular format to be able to analyse it.

OpenMeteo's JSON is overall fairly straightforward. If you just used a single location and a single temporal resolution (either `daily` or `hourly`), the JSON output was mostly "flat" and could be easily converted into a DataFrame. 

<div style="display: flex; flex-wrap: wrap; flex-direction:row;justify-content: left; margin: 0.5em;font-size:0.9em">

<div style="color: #333333; background-color:#ffffff; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px 0 20px 10px; margin: 10px 0 10px 0; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 350px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;">

**What is considered a "flat" JSON?**

A "flat" JSON is one where the keys are all at the same level, and the values are either atomic (strings, numbers, booleans) or lists of atomic values.

For example:

```json
{
    "key1": "value1",
    "key2": "value2",
    "key3": [1, 2, 3]
}
```

</div>

<div style="color: #333333; background-color:#ffffff; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px 0 20px 10px; margin: 10px 0 10px 0; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 350px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;margin-left:2em">

**What is considered a "nested" JSON?**

A "nested" JSON is one where the keys are at different levels, you have dictionaries within dictionaries, or lists of dictionaries.

For example:

```json
{
    "key1": "value1",
    "key2": {
        "key3": "value3",
        "key4": "value4"
    },
    "key5": [
        {"key6": "value6"},
        {"key7": "value7"}
    ]
}
```

</div>

</div>

[🤔 **Think about it:** What is the easiest way to convert JSON data into a tabular format using Python?]{style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;"}

To answer that question, let's first understand the structure of the JSON data we are working with. I will use OpenMeteo first as an example. 

## 1.1 Let's look at a familiar example

What follows below is the response I get after requesting historical weather data from OpenMeteo using the following parameters:

| Variable             | Value                                                                 |
|----------------------|-----------------------------------------------------------------------|
| **Latitude**             | 51.50853                                                             |
| **Longitude**            | -0.12574                                                            |
| **Period**               | from 26/Oct/2024 until 09/Nov/2024 <br> (when it was super grey in London!)      |
| **Frequency**            | daily                                                                 |
| **Weather variables**    | weather code, daylight duration, sunshine duration                    |

I copied the output and stored it as a dictionary to save us the hassle of sending a request:

In [2]:
json_one_location = {
  "latitude": 51.50853,
  "longitude": -0.12574,
  "generationtime_ms": 0.154972076416016,
  "utc_offset_seconds": 0,
  "timezone": "GMT",
  "timezone_abbreviation": "GMT",
  "elevation": 23,
  "daily_units": {
    "time": "iso8601",
    "weather_code": "wmo code",
    "daylight_duration": "s",
    "sunshine_duration": "s"
  },
  "daily": {
    "time": [
      "2024-10-26",
      "2024-10-27",
      "2024-10-28",
      "2024-10-29",
      "2024-10-30",
      "2024-10-31",
      "2024-11-01",
      "2024-11-02",
      "2024-11-03",
      "2024-11-04",
      "2024-11-05",
      "2024-11-06",
      "2024-11-07",
      "2024-11-08",
      "2024-11-09"
    ],
    "weather_code": [3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51, 3],
    "daylight_duration": [35989.18, 35766.05, 35543.86, 35322.81, 35103.12, 34885.01, 34668.7, 34454.44, 34242.45, 34032.99, 33826.3, 33622.64, 33422.29, 33225.15, 33029.27],
    "sunshine_duration": [14002.7, 31127.43, 5811.8, 10787.3, 19920.14, 14421.7, 14437.86, 0, 1098.29, 8162.58, 17941.69, 536.17, 4472.84, 0, 7173.36]
  }
}

[🤔 **Think about it:** Is this a flat or nested JSON object?]{style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;"}

What about the output of OpenMeteo when I request data for two locations at once?

👇

In [3]:
json_two_locations = [
  {
    "latitude": 51.50853,
    "longitude": -0.12574,
    "generationtime_ms": 0.200033187866211,
    "utc_offset_seconds": 0,
    "timezone": "GMT",
    "timezone_abbreviation": "GMT",
    "elevation": 23,
    "daily_units": {
      "time": "iso8601",
      "weather_code": "wmo code",
      "daylight_duration": "s",
      "sunshine_duration": "s"
    },
    "daily": {
      "time": [
        "2024-10-26",
        "2024-10-27",
        "2024-10-28",
        "2024-10-29",
        "2024-10-30",
        "2024-10-31",
        "2024-11-01",
        "2024-11-02",
        "2024-11-03",
        "2024-11-04",
        "2024-11-05",
        "2024-11-06",
        "2024-11-07",
        "2024-11-08",
        "2024-11-09"
      ],
      "weather_code": [3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51, 3],
      "daylight_duration": [35989.18, 35766.05, 35543.86, 35322.81, 35103.12, 34885.01, 34668.7, 34454.44, 34242.45, 34032.99, 33826.3, 33622.64, 33422.29, 33225.15, 33029.27],
      "sunshine_duration": [14002.7, 31127.43, 5811.8, 10787.3, 19920.14, 14421.7, 14437.86, 0, 1098.29, 8162.58, 17941.69, 536.17, 4472.84, 0, 7173.36]
    }
  },
  {
    "latitude": 48.85341,
    "longitude": 	2.3488,
    "generationtime_ms": 0.160098075866699,
    "utc_offset_seconds": 0,
    "timezone": "GMT",
    "timezone_abbreviation": "GMT",
    "elevation": 43,
    "location_id": 1,
    "daily_units": {
      "time": "iso8601",
      "weather_code": "wmo code",
      "daylight_duration": "s",
      "sunshine_duration": "s"
    },
    "daily": {
      "time": [
        "2024-10-26",
        "2024-10-27",
        "2024-10-28",
        "2024-10-29",
        "2024-10-30",
        "2024-10-31",
        "2024-11-01",
        "2024-11-02",
        "2024-11-03",
        "2024-11-04",
        "2024-11-05",
        "2024-11-06",
        "2024-11-07",
        "2024-11-08",
        "2024-11-09"
      ],
      "weather_code": [55, 53, 3, 3, 3, 3, 3, 51, 3, 3, 3, 3, 3, 3, 3],
      "daylight_duration": [36679.21, 36477.63, 36276.98, 36077.46, 35879.25, 35682.56, 35487.58, 35294.54, 35103.64, 34915.11, 34729.17, 34546.06, 34366.02, 34188.94, 34013.09],
      "sunshine_duration": [17721.39, 1329.73, 14961.37, 3869.25, 13851.97, 17390.9, 130.73, 0, 17079.12, 17671.44, 0, 9888.03, 0, 0, 10666.35]
    }
  }
]

👨🏻‍🏫 **TEACHING MOMENT:** Watch me as I demonstrate how to browse through the JSON data on VSCode.

## 1.2 Flat JSON makes for easy conversion to tabular format

I can easily convert a flat dictionary (similarly, a flat JSON object) into a DataFrame using the `pd.DataFrame()` function:

In [4]:
# A flat dictionary
flat_dict = {
    "latitude": 51.493847,
    "longitude": -0.1630249,
    "generationtime_ms": 0.154972076416016,
    "utc_offset_seconds": 0,
    "timezone": "GMT",
    "timezone_abbreviation": "GMT",
    "elevation": 23,
}

As a Series:

In [5]:
pd.Series(flat_dict)

latitude                 51.493847
longitude                -0.163025
generationtime_ms         0.154972
utc_offset_seconds               0
timezone                       GMT
timezone_abbreviation          GMT
elevation                       23
dtype: object

Or as a DataFrame:


In [6]:
# To create a DataFrame with just a single row, you need to specify an index.
pd.DataFrame(flat_dict, index=[0])

Unnamed: 0,latitude,longitude,generationtime_ms,utc_offset_seconds,timezone,timezone_abbreviation,elevation
0,51.493847,-0.163025,0.154972,0,GMT,GMT,23


This is also true for when the values are lists. Take for example the dictionary at `json_output['daily']`:

In [7]:
# The dictionary that is contained at the 'daily' key is a dictionary of lists.
# We also consider this a flat dictionary because the lists are all the same length.
flat_dict = json_one_location['daily']

# This time you don't need to specify an index. 
# Pandas will automatically expand (explode) the lists into multiple rows 
# and assign an increasing integer index.
pd.DataFrame(flat_dict)

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration
0,2024-10-26,3,35989.18,14002.7
1,2024-10-27,51,35766.05,31127.43
2,2024-10-28,51,35543.86,5811.8
3,2024-10-29,51,35322.81,10787.3
4,2024-10-30,3,35103.12,19920.14
5,2024-10-31,3,34885.01,14421.7
6,2024-11-01,3,34668.7,14437.86
7,2024-11-02,51,34454.44,0.0
8,2024-11-03,3,34242.45,1098.29
9,2024-11-04,3,34032.99,8162.58


## 1.3 It gets trickier with nested JSON

What if instead of creating a DataFrame just for the `daily` key, I create a DataFrame for the `json_one_location` object?

In [8]:
pd.DataFrame(json_one_location)

Unnamed: 0,latitude,longitude,generationtime_ms,utc_offset_seconds,timezone,timezone_abbreviation,elevation,daily_units,daily
time,51.50853,-0.12574,0.154972,0,GMT,GMT,23,iso8601,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2..."
weather_code,51.50853,-0.12574,0.154972,0,GMT,GMT,23,wmo code,"[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51..."
daylight_duration,51.50853,-0.12574,0.154972,0,GMT,GMT,23,s,"[35989.18, 35766.05, 35543.86, 35322.81, 35103..."
sunshine_duration,51.50853,-0.12574,0.154972,0,GMT,GMT,23,s,"[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."


📢 **IT IS REALLY IMPORTANT TO PAY ATTENTION TO WHAT IS HAPPENING HERE:**

Pandas will make inferences about the structure of the data and will try to create a DataFrame that makes sense

- The keys that are at the top-level dictionary take priority and become columns:

    - `latitude`

    - `longitude`
    
    - `generationtime_ms`
    
    - etc.

The two top-level keys that are dictionaries (`daily_units` and `daily`) are also converted to columns, but with a twist:

- Each one of these two dictionaries become a column in the DataFrame

- **Each one of the invididual keys of these dictionaries becomes a row**.

   The keys of these internal dictionaries are moved to the **Index**. (`time`, `weathercode`, `daylight_duration`, `sunshine_duration`)

- The values of these dictionaries become the values to put on the rows with the corresponding index levels

Notice also this about these two special columns:

- Because each of the values in the `json_one_location['daily_units']` dictionary are individual primitive data types (strings, integers, floats), they come out as individual values in the DataFrame

- As for the `json_one_location['daily']` dictionary, the values are lists, and they are kept as lists in the DataFrame

Notice yet another important thing:

- The keys of the `json_one_location['daily']` and the `json_one_location['daily_units']` dictionaries are exactly the same four keys:

    - `time`

    - `weathercode`

    - `daylight_duration`

    - `sunshine_duration`

- This means that the values of `json_one_location['daily']` and `json_one_location['daily_units']` are aligned in the DataFrame


[🤔 **Now, think about it:** Why would I want the DataFrame in that format? What is it good for?]{style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;"}

It all depends on what you want to do with the data afterwards.

Because right now, we just have a single location (London's latitude and longitude), this repetition of the location information is redundant, wasteful. 

**We do not want wasteful dataframes in this course! We want to live in a world where data is clean, tidy, economical, and efficient!** 

In that case, we can conclude that we just care about the `daily` key and we can discard the rest of the information. That is, `pd.DataFrame(json_output['daily'])` was the right way to go anyways.

# 2. Meet the `json_normalize()` function

Pandas has a special function just to deal with nested JSON data: `pd.json_normalize()`. 

As the [official documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html#pandas.json_normalize) puts it, the purpose of this function is:

> Normalize semi-structured JSON data into a flat table.

The function expects, as an input a **dict or list of dicts** (Unserialized JSON objects) and will return a DataFrame.

## 2.1 Normalising the JSON data (one location)

Take a look at what happens when we apply the `pd.json_normalize()` function to the `json_one_location` object:

In [9]:
pd.json_normalize(json_one_location)

Unnamed: 0,latitude,longitude,generationtime_ms,utc_offset_seconds,timezone,timezone_abbreviation,elevation,daily_units.time,daily_units.weather_code,daily_units.daylight_duration,daily_units.sunshine_duration,daily.time,daily.weather_code,daily.daylight_duration,daily.sunshine_duration
0,51.50853,-0.12574,0.154972,0,GMT,GMT,23,iso8601,wmo code,s,s,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2...","[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51...","[35989.18, 35766.05, 35543.86, 35322.81, 35103...","[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."


Compare that once again to the DataFrame we got from `pd.DataFrame()`:

In [10]:
pd.DataFrame(json_one_location)

Unnamed: 0,latitude,longitude,generationtime_ms,utc_offset_seconds,timezone,timezone_abbreviation,elevation,daily_units,daily
time,51.50853,-0.12574,0.154972,0,GMT,GMT,23,iso8601,"[2024-10-26, 2024-10-27, 2024-10-28, 2024-10-2..."
weather_code,51.50853,-0.12574,0.154972,0,GMT,GMT,23,wmo code,"[3, 51, 51, 51, 3, 3, 3, 51, 3, 3, 3, 3, 3, 51..."
daylight_duration,51.50853,-0.12574,0.154972,0,GMT,GMT,23,s,"[35989.18, 35766.05, 35543.86, 35322.81, 35103..."
sunshine_duration,51.50853,-0.12574,0.154972,0,GMT,GMT,23,s,"[14002.7, 31127.43, 5811.8, 10787.3, 19920.14,..."


**What is different?**

Instead of converting the keys of the `daily` dictionary into columns, the `pd.json_normalize()` function has created a column for each key of the `daily` dictionary.
    
The new columns are named:
- `daily.time`,

- `daily.weathercode`,

- `daily.daylight_duration`, and

- `daily.sunshine_duration`.

Notice  how these columns are named after the keys of the `daily` dictionary, but with a dot `.` in between the dictionary key and the key of the nested dictionary.

The same thing happens with the `daily_units` dictionary:

- `daily_units.time`,

- `daily_units.weathercode`,

- `daily_units.daylight_duration`, and

- `daily_units.sunshine_duration`.

All the rest of the columns are kept as they were (top-level keys of the `json_one_location` dictionary).

 [💡 **TIP:** To use a more technical vocabulary, the `json_normalize()` function produces a DataFrame that is wider (more focused on columns) whereas the `pd.DataFrame()` approach produces a longer DataFrame (where the focus is on the rows)]{style="display:block;color: #333333; background-color:rgba(93, 158, 188, 0.075);border-radius: 10px; box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);padding:1em;font-size:0.9em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;max-width:450px"}

### Subsetting and exploding again all in a single ⭐️ method chain ⭐️

Now, if I want to explode the lists in the DataFrame, I can do so by using the `explode()` method:

In [11]:
(
    pd.json_normalize(json_one_location)
    # I don't care about the Daily Units (for now at least)
    .drop(columns=['daily_units.time', 'daily_units.weather_code', 'daily_units.daylight_duration', 'daily_units.sunshine_duration'])

    # I would rather have the lists expanded into rows
    .explode(['daily.time', 'daily.weather_code', 'daily.daylight_duration', 'daily.sunshine_duration'])

    # I don't like the 'daily.' prefix
    .rename(columns={
        'daily.time': 'time',
        'daily.weather_code': 'weather_code',
        'daily.daylight_duration': 'daylight_duration',
        'daily.sunshine_duration': 'sunshine_duration'
    })

    # There are these other useless columns that I don't care about
    .drop(columns=['generationtime_ms', 'utc_offset_seconds', 'timezone', 'timezone_abbreviation', 'elevation'])
)

Unnamed: 0,latitude,longitude,time,weather_code,daylight_duration,sunshine_duration
0,51.50853,-0.12574,2024-10-26,3,35989.18,14002.7
0,51.50853,-0.12574,2024-10-27,51,35766.05,31127.43
0,51.50853,-0.12574,2024-10-28,51,35543.86,5811.8
0,51.50853,-0.12574,2024-10-29,51,35322.81,10787.3
0,51.50853,-0.12574,2024-10-30,3,35103.12,19920.14
0,51.50853,-0.12574,2024-10-31,3,34885.01,14421.7
0,51.50853,-0.12574,2024-11-01,3,34668.7,14437.86
0,51.50853,-0.12574,2024-11-02,51,34454.44,0.0
0,51.50853,-0.12574,2024-11-03,3,34242.45,1098.29
0,51.50853,-0.12574,2024-11-04,3,34032.99,8162.58


<details><summary>Click here to see a different way to do the same thing without method chaining</summary>

If you are not a fan of method chaining, you can also do the following:

```python
df = pd.json_normalize(json_one_location)

# I don't care about the Daily Units (for now at least)
df = df.drop(columns=['daily_units.time', 'daily_units.weather_code', 'daily_units.daylight_duration', 'daily_units.sunshine_duration'])

# I would rather have the lists expanded into rows
df = df.explode(['daily.time', 'daily.weather_code', 'daily.daylight_duration', 'daily.sunshine_duration'])

# I don't like the 'daily.' prefix
df = df.rename(columns={
        'daily.time': 'time',
        'daily.weather_code': 'weather_code',
        'daily.daylight_duration': 'daylight_duration',
        'daily.sunshine_duration': 'sunshine_duration'
    })

    # There are these other useless columns that I don't care about
df = df.drop(columns=['generationtime_ms', 'utc_offset_seconds', 'timezone', 'timezone_abbreviation', 'elevation'])

```

The advantage of method chaining is that it is more concise, you don't need to keep track of intermediate DataFrames, and, I'd argue, it is also easier to read if well formatted. I expand on this at the end of the notebook.

</details>

## 2.2 Normalising the JSON data (two locations)

The same code above will work for the `json_two_locations` object. 

The only difference is that OpenMeteo returns an additional key called `location_id` where it specifies if that data sample is for the first or the second location. Weirdly, it only starts counting from the second location, rendering the first location with a missing `location_id` (NaN).

In [13]:
df_sunshine = (
    pd.json_normalize(json_two_locations)
    # I don't care about the Daily Units (for now at least)
    .drop(columns=['daily_units.time', 'daily_units.weather_code', 'daily_units.daylight_duration', 'daily_units.sunshine_duration'])

    # I would rather have the lists expanded into rows
    .explode(['daily.time', 'daily.weather_code', 'daily.daylight_duration', 'daily.sunshine_duration'])

    # I don't like the 'daily.' prefix
    .rename(columns={
        'daily.time': 'time',
        'daily.weather_code': 'weather_code',
        'daily.daylight_duration': 'daylight_duration',
        'daily.sunshine_duration': 'sunshine_duration'
    })

    # There are these other useless columns that I don't care about
    .drop(columns=['generationtime_ms', 'utc_offset_seconds', 'timezone', 'timezone_abbreviation', 'elevation'])

)

# I could also drop the 'location_id' column, but I will keep it for now.
df_sunshine

Unnamed: 0,latitude,longitude,time,weather_code,daylight_duration,sunshine_duration,location_id
0,51.50853,-0.12574,2024-10-26,3,35989.18,14002.7,
0,51.50853,-0.12574,2024-10-27,51,35766.05,31127.43,
0,51.50853,-0.12574,2024-10-28,51,35543.86,5811.8,
0,51.50853,-0.12574,2024-10-29,51,35322.81,10787.3,
0,51.50853,-0.12574,2024-10-30,3,35103.12,19920.14,
0,51.50853,-0.12574,2024-10-31,3,34885.01,14421.7,
0,51.50853,-0.12574,2024-11-01,3,34668.7,14437.86,
0,51.50853,-0.12574,2024-11-02,51,34454.44,0.0,
0,51.50853,-0.12574,2024-11-03,3,34242.45,1098.29,
0,51.50853,-0.12574,2024-11-04,3,34032.99,8162.58,


# 3. Merging data

We have explored merge briefly in the lecture before, but let's revisit it here.

[🤔 **Think about it:** Wouldn't it be great if instead of `latitude` and `longitude`, we just had the city name?]{style="display:block;background-color:rgba(93, 158, 188, 0.1);padding:0.5em;font-size:1.05em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;"}

We DO have data on the city names! Let's read that old CSV file we used in the past:

In [14]:
df_world_cities = pd.read_csv('../../W03-Formative-Solutions/data/world_cities.csv')

# Show a sample of the data just to get an idea of what it looks like
df_world_cities.head()

Unnamed: 0,country,name,lat,lng
0,AD,El Tarter,42.57952,1.65362
1,AD,Sant Julià de Lòria,42.46372,1.49129
2,AD,Pas de la Casa,42.54277,1.73361
3,AD,Ordino,42.55623,1.53319
4,AD,les Escaldes,42.50729,1.53414


**We want to merge the `df_sunshine` DataFrame with the `df_world_cities` DataFrame.**

What does that mean? It means that we want to add the `name` and `country` columns from the `df_world_cities` DataFrame to the `df_sunshine` DataFrame.

Then, I'd be able to drop the `latitude` and `longitude` columns that are not very human-readable.

### How to perform a merge

The `pd.merge()` function is the way to go. It expects the following arguments:

- `left`: the left DataFrame

- `right`: the right DataFrame

- `how`: the type of merge you want to perform (inner, outer, left, right)

    - `inner`: keeps only the rows that have a match in both DataFrames

    - `outer`: keeps all rows from both DataFrames, even if they don't have a match

    - `left`: keeps all rows from the left DataFrame, even if they don't have a match with the right DataFrame

    - `right`: keeps all rows from the right DataFrame, even if they don't have a match with the left DataFrame

- `left_on`: the column(s) on the left DataFrame that you want to use to merge

- `right_on`: the column(s) on the right DataFrame that you want to use to merge

It is important to note that the columns used to merge the DataFrames have to represent the same information. In this case, the `latitude` and `longitude` columns from the `df_sunshine` DataFrame represent the same information as the `lat` and `lng` columns from the `df_world_cities` DataFrame.

In [15]:
pd.merge(left=df_sunshine, right=df_world_cities, how='left', left_on=['latitude', 'longitude'], right_on=['lat', 'lng'])

Unnamed: 0,latitude,longitude,time,weather_code,daylight_duration,sunshine_duration,location_id,country,name,lat,lng
0,51.50853,-0.12574,2024-10-26,3,35989.18,14002.7,,GB,London,51.50853,-0.12574
1,51.50853,-0.12574,2024-10-27,51,35766.05,31127.43,,GB,London,51.50853,-0.12574
2,51.50853,-0.12574,2024-10-28,51,35543.86,5811.8,,GB,London,51.50853,-0.12574
3,51.50853,-0.12574,2024-10-29,51,35322.81,10787.3,,GB,London,51.50853,-0.12574
4,51.50853,-0.12574,2024-10-30,3,35103.12,19920.14,,GB,London,51.50853,-0.12574
5,51.50853,-0.12574,2024-10-31,3,34885.01,14421.7,,GB,London,51.50853,-0.12574
6,51.50853,-0.12574,2024-11-01,3,34668.7,14437.86,,GB,London,51.50853,-0.12574
7,51.50853,-0.12574,2024-11-02,51,34454.44,0.0,,GB,London,51.50853,-0.12574
8,51.50853,-0.12574,2024-11-03,3,34242.45,1098.29,,GB,London,51.50853,-0.12574
9,51.50853,-0.12574,2024-11-04,3,34032.99,8162.58,,GB,London,51.50853,-0.12574


**WHAT HAPPENED THERE?**

We merged left on the `latitude` and `longitude` columns of the `df_sunshine` DataFrame and right on the `lat` and `lng` columns of the `df_world_cities` DataFrame.

This means that:

1. All the columns of the `df_sunshine` DataFrame were kept intact 

2. Pandas identified, for each row, the corresponding row in the `df_world_cities` DataFrame that had the same `lat` and `lng` values

3. Pandas added all the columns from the `df_world_cities` DataFrame to the `df_sunshine` DataFrame, filling in with the values of the corresponding row in the `df_world_cities` DataFrame

## How is that useful?

Well, we can now drop all the columns related to the location and keep only the `name` and `country` columns:

In [16]:
(
    pd.merge(left=df_sunshine, right=df_world_cities, how='left', left_on=['latitude', 'longitude'], right_on=['lat', 'lng'])
    # While I'm at it, I might as well drop the location_id column as well
    .drop(columns=['lat', 'lng', 'latitude', 'longitude', 'location_id'])
)

Unnamed: 0,time,weather_code,daylight_duration,sunshine_duration,country,name
0,2024-10-26,3,35989.18,14002.7,GB,London
1,2024-10-27,51,35766.05,31127.43,GB,London
2,2024-10-28,51,35543.86,5811.8,GB,London
3,2024-10-29,51,35322.81,10787.3,GB,London
4,2024-10-30,3,35103.12,19920.14,GB,London
5,2024-10-31,3,34885.01,14421.7,GB,London
6,2024-11-01,3,34668.7,14437.86,GB,London
7,2024-11-02,51,34454.44,0.0,GB,London
8,2024-11-03,3,34242.45,1098.29,GB,London
9,2024-11-04,3,34032.99,8162.58,GB,London


You might also want to use the `rename()` function to rename the columns to something more meaningful:

In [17]:
(
    pd.merge(left=df_sunshine, right=df_world_cities, how='left', left_on=['latitude', 'longitude'], right_on=['lat', 'lng'])
    # While I'm at it, I might as well drop the location_id column as well
    .drop(columns=['lat', 'lng', 'latitude', 'longitude', 'location_id'])
    .rename(columns={'name': 'city', 'time': 'date'})
    # Reorder the columns
    [['city', 'date', 'weather_code', 'daylight_duration', 'sunshine_duration']]
)

Unnamed: 0,city,date,weather_code,daylight_duration,sunshine_duration
0,London,2024-10-26,3,35989.18,14002.7
1,London,2024-10-27,51,35766.05,31127.43
2,London,2024-10-28,51,35543.86,5811.8
3,London,2024-10-29,51,35322.81,10787.3
4,London,2024-10-30,3,35103.12,19920.14
5,London,2024-10-31,3,34885.01,14421.7
6,London,2024-11-01,3,34668.7,14437.86
7,London,2024-11-02,51,34454.44,0.0
8,London,2024-11-03,3,34242.45,1098.29
9,London,2024-11-04,3,34032.99,8162.58


# 4. A huge method chain to process the data from OpenMeteo

I've done a lot of repetition above with the intention to demonstrate the different ways you can manipulate JSON data.

Ultimately, though, it is in the philosophy of this course to be as efficient as possible and to not add as many intermediate DataFrames as I did above.

Let me show you how a huge method chain can be used to process the data from OpenMeteo in a single (huge) command:

In [18]:
df_sunshine = (
    pd.json_normalize(json_two_locations)
    # I don't care about the Daily Units (for now at least)
    .drop(columns=['daily_units.time', 'daily_units.weather_code', 'daily_units.daylight_duration', 'daily_units.sunshine_duration'])

    # I would rather have the lists expanded into rows
    .explode(['daily.time', 'daily.weather_code', 'daily.daylight_duration', 'daily.sunshine_duration'])

    # I don't like the 'daily.' prefix
    .rename(columns={
        'daily.time': 'time',
        'daily.weather_code': 'weather_code',
        'daily.daylight_duration': 'daylight_duration',
        'daily.sunshine_duration': 'sunshine_duration'
    })

    # There are these other useless columns that I don't care about
    .drop(columns=['generationtime_ms', 'utc_offset_seconds', 'timezone', 'timezone_abbreviation', 'elevation', 'location_id'])

    .merge(right=df_world_cities, how='left', left_on=['latitude', 'longitude'], right_on=['lat', 'lng'])
    .drop(columns=['lat', 'lng', 'latitude', 'longitude'])
    .rename(columns={'name': 'city', 'time': 'date'})
    # Reorder the columns
    [['city', 'date', 'weather_code', 'daylight_duration', 'sunshine_duration']]
)

# I could also drop the 'location_id' column, but I will keep it for now.
df_sunshine

Unnamed: 0,city,date,weather_code,daylight_duration,sunshine_duration
0,London,2024-10-26,3,35989.18,14002.7
1,London,2024-10-27,51,35766.05,31127.43
2,London,2024-10-28,51,35543.86,5811.8
3,London,2024-10-29,51,35322.81,10787.3
4,London,2024-10-30,3,35103.12,19920.14
5,London,2024-10-31,3,34885.01,14421.7
6,London,2024-11-01,3,34668.7,14437.86
7,London,2024-11-02,51,34454.44,0.0
8,London,2024-11-03,3,34242.45,1098.29
9,London,2024-11-04,3,34032.99,8162.58


**Why is this preferred?** 

A little bit is down to taste. You might just think that this is too complex and that you'd rather have the intermediate DataFrames to check if everything is going as expected. I get that. I also use intermediate DataFrames when I'm building the solution, but at the end once I'm happy with the steps I've taken, I like to chain everything together.

 [💡 **TIP:** There is also a very practical reason why I'd like you to practice this approach to writing code. Chaining operations is the default way of working with data in other data analysis libraries (dplyr in R, SQL, etc.). If you master this skill in Python, you will be able to apply it to any other data manipulation tool you come across in the future. ]{style="display:block;color: #333333; background-color:rgba(93, 158, 188, 0.075);border-radius: 10px; box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);padding:1em;font-size:0.9em;margin-left:0em;margin-bottom:1em;border-radius:0.5em;max-width:500px"}

**NOW WHAT?**

1. Read the [appendix notebook](./LSE_DS105A_2024_W07b_lecture_appendix.ipynb) to gain a deeper insight into the process of normalising JSON data. This notebook should help you appreciate the power of the `json_normalize()` function.

2. Tomorrow in the labs (15 November 2024), you will be given a different dataset to work with, and you will have to apply the concepts you learned today to convert the JSON data into a tabular format. **Keep your notes about json_normalize() handy!**

3. There will also be a moment where your class teachers will tell you about the `groupby -> apply -> combine` strategy to summarise and prepare data for plots than using loops.