# JSON data wrangling with python

### World Bank API example

Solving the double/multiple nesting problem

In [122]:
import requests, pandas as pd

In [123]:
url ='https://api.worldbank.org/v2/country/GBR/indicator/SP.POP.TOTL?format=json'
html = requests.get(url)

In [124]:
# Extract JSOn response from our request
json_data=html.json()

In [125]:
type(json_data)

list

Next, let's load this data into a Pandas DataFrame. The easiest format for this is when the data is a list of dictionaries (i.e. the format used to define inline data in a Vega-Lite chart).

Let's view the API response to check the structure. We can click the API URL to view it in browser, or print `json_data` data here. 

In [126]:
print(json_data)

[{'page': 1, 'pages': 2, 'per_page': 50, 'total': 64, 'sourceid': '2', 'lastupdated': '2024-10-24'}, [{'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'}, 'country': {'id': 'GB', 'value': 'United Kingdom'}, 'countryiso3code': 'GBR', 'date': '2023', 'value': 68350000, 'unit': '', 'obs_status': '', 'decimal': 0}, {'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'}, 'country': {'id': 'GB', 'value': 'United Kingdom'}, 'countryiso3code': 'GBR', 'date': '2022', 'value': 67791000, 'unit': '', 'obs_status': '', 'decimal': 0}, {'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'}, 'country': {'id': 'GB', 'value': 'United Kingdom'}, 'countryiso3code': 'GBR', 'date': '2021', 'value': 67026292, 'unit': '', 'obs_status': '', 'decimal': 0}, {'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'}, 'country': {'id': 'GB', 'value': 'United Kingdom'}, 'countryiso3code': 'GBR', 'date': '2020', 'value': 67081234, 'unit': '', 'obs_status': '', 'decimal': 0}

The first item in the list is some metadata in dictionary format. The second item in the list is a list of dictionaries. This second item has the data we want so let's load just this into a `pandas` dataframe.

In [127]:
# create dataframe directly
df = pd.DataFrame(json_data[1])
df

Unnamed: 0,indicator,country,countryiso3code,date,value,unit,obs_status,decimal
0,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'GB', 'value': 'United Kingdom'}",GBR,2023,68350000,,,0
1,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'GB', 'value': 'United Kingdom'}",GBR,2022,67791000,,,0
2,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'GB', 'value': 'United Kingdom'}",GBR,2021,67026292,,,0
3,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'GB', 'value': 'United Kingdom'}",GBR,2020,67081234,,,0
4,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'GB', 'value': 'United Kingdom'}",GBR,2019,66836327,,,0
5,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'GB', 'value': 'United Kingdom'}",GBR,2018,66460344,,,0
6,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'GB', 'value': 'United Kingdom'}",GBR,2017,66058859,,,0
7,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'GB', 'value': 'United Kingdom'}",GBR,2016,65611593,,,0
8,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'GB', 'value': 'United Kingdom'}",GBR,2015,65116219,,,0
9,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'GB', 'value': 'United Kingdom'}",GBR,2014,64602298,,,0


But, some of our values are nested dictionary (JSON) objects. We want to extract the values contained within them. 

In [128]:
# `country` column values are actually dictionaries, with keys 'id' and 'value'. Extract the 'value' key from each dictionary
df['country'] = df['country'].apply(lambda x: x['value'])       

We use '.apply' to apply a function to each element of a column. The Lambda function is a short way to define a function in one line, where we can specify an expression to apply on every value. Here, we are extracting the 'value' key from each dictionary in the 'country' column.

In [129]:
df.head()

Unnamed: 0,indicator,country,countryiso3code,date,value,unit,obs_status,decimal
0,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...",United Kingdom,GBR,2023,68350000,,,0
1,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...",United Kingdom,GBR,2022,67791000,,,0
2,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...",United Kingdom,GBR,2021,67026292,,,0
3,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...",United Kingdom,GBR,2020,67081234,,,0
4,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...",United Kingdom,GBR,2019,66836327,,,0


We don't need the indicator column, so we can drop this.

In [130]:
df = df.drop(columns=['indicator'])

## Can also drop columns by selecting columns to keep (using double brackets) 
# df[['country', 'countryiso3code', 'date', 'value']]
df.head()

Unnamed: 0,country,countryiso3code,date,value,unit,obs_status,decimal
0,United Kingdom,GBR,2023,68350000,,,0
1,United Kingdom,GBR,2022,67791000,,,0
2,United Kingdom,GBR,2021,67026292,,,0
3,United Kingdom,GBR,2020,67081234,,,0
4,United Kingdom,GBR,2019,66836327,,,0


Alternatively, we could create arrays by looping through the original data. Then, we can create a dataframe out of those arrays.

In [131]:
countries = []
dates = []
values = []
# create array in loop
for i in json_data[1]:
    # Each item in the json_data[1] list is an observation.
    # For each item 'i', let's append the value we want to each of the lists we created above.
    countries.append(i['country']['value'])
    dates.append(i['date'])
    values.append(i['value'])

# create dataframe from lists. We define the column names as the keys, with our new arrays as the values.
df = pd.DataFrame({
    'country': countries,
     'date': dates,
     'value': values
})

df

Unnamed: 0,country,date,value
0,United Kingdom,2023,68350000
1,United Kingdom,2022,67791000
2,United Kingdom,2021,67026292
3,United Kingdom,2020,67081234
4,United Kingdom,2019,66836327
5,United Kingdom,2018,66460344
6,United Kingdom,2017,66058859
7,United Kingdom,2016,65611593
8,United Kingdom,2015,65116219
9,United Kingdom,2014,64602298


<br>

--- 

<br>

# Scraping *light*

### Wikipedia example

Reading from `HTML` tables directly with `pandas`.

In [132]:
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita')

Pandas' `read_html` tries to extract all the tables that it finds on the page. All of these tables are returned as a list of dataframes (in the order they were found). So, to access any of these dataframe tables, we just need to specify the list index.

In [133]:
tables[0]

Unnamed: 0,0,1,2
0,">$60,000 $50,000 – $60,000 $40,000 – $50,000 $...","$20,000 – $30,000 $10,000 – $20,000 $5,000 – $...","$1,000 – $2,500 <$1,000 No data"


In [134]:
# Get table 2 (at index 1)
df = tables[1]

df.head()

Unnamed: 0_level_0,Country/Territory,IMF[5][6],IMF[5][6],World Bank[7],World Bank[7],CIA[8][9][10],CIA[8][9][10]
Unnamed: 0_level_1,Country/Territory,Projection,Year,Estimate,Year,Estimate,Year
0,Luxembourg *,151146,2024,143342,2023,115700,2021
1,Singapore *,148186,2024,141500,2023,106000,2021
2,Liechtenstein *,—,—,—,—,139100,2009
3,Macau *,130417,2024,113183,2023,64800,2021
4,Ireland *,127750,2024,127623,2023,102500,2021


This dataframe has multi-indexed columns (i.e. a hierarchy with a column and sub-columns)

Luckily, we can rename them as normal (by providing a list of new column names), and this will remove the multi-index.

In [135]:
df.columns

MultiIndex([('Country/Territory', 'Country/Territory'),
            (        'IMF[5][6]',        'Projection'),
            (        'IMF[5][6]',              'Year'),
            (    'World Bank[7]',          'Estimate'),
            (    'World Bank[7]',              'Year'),
            (    'CIA[8][9][10]',          'Estimate'),
            (    'CIA[8][9][10]',              'Year')],
           )

In [136]:
# Rename columns
df.columns = ['country', 'IMF', 'IMFyear', 'WB', 'WByear', 'CIA', 'CIAyear']

In [137]:
df

Unnamed: 0,country,IMF,IMFyear,WB,WByear,CIA,CIAyear
0,Luxembourg *,151146,2024,143342,2023,115700,2021
1,Singapore *,148186,2024,141500,2023,106000,2021
2,Liechtenstein *,—,—,—,—,139100,2009
3,Macau *,130417,2024,113183,2023,64800,2021
4,Ireland *,127750,2024,127623,2023,102500,2021
...,...,...,...,...,...,...,...
225,Malawi *,1714,2024,1868,2023,1500,2021
226,North Korea *,—,—,—,—,1700,2015
227,Central African Republic *,1296,2024,1130,2023,800,2021
228,Burundi *,986,2024,951,2023,700,2021


`pandas` DataFrame sorting

In [138]:
df.sort_values(by=['country'])      # We can sort 'by' a list of column names if we want to sort by multiple columns.

Unnamed: 0,country,IMF,IMFyear,WB,WByear,CIA,CIAyear
216,Afghanistan *,2116,2022,2093,2022,1500,2021
113,Albania *,21377,2024,21395,2023,14500,2021
126,Algeria *,17718,2024,17027,2023,11000,2021
154,American Samoa *,—,—,—,—,11200,2016
30,Andorra *,68612,2024,71588,2023,49900,2015
...,...,...,...,...,...,...,...
199,Wallis and Futuna *,—,—,—,—,3800,2004
103,World,24567,[i]2024,20946,2022,17000,2021
217,Yemen *,1996,2024,3437,2013,2500,2017
195,Zambia *,4190,2024,4126,2023,3200,2021


Let's remove the asterisks from country names.

Using the `.replace()` function, we pass the target string and the string we want to replace that target with. **Note:** we need to use `.str` to access .replace (and any other string methods) on a dataframe.

In [139]:
df['country'].str.replace('*','')       # When using .replace, we need to use .str to access the string methods.

0                    Luxembourg 
1                     Singapore 
2                 Liechtenstein 
3                         Macau 
4                       Ireland 
                 ...            
225                      Malawi 
226                 North Korea 
227    Central African Republic 
228                     Burundi 
229                 South Sudan 
Name: country, Length: 230, dtype: object

In [140]:
# Let's save the steps we tested above.
df = df.sort_values(by='country')
df['country'] = df['country'].str.replace('*','')

In [141]:
df

Unnamed: 0,country,IMF,IMFyear,WB,WByear,CIA,CIAyear
216,Afghanistan,2116,2022,2093,2022,1500,2021
113,Albania,21377,2024,21395,2023,14500,2021
126,Algeria,17718,2024,17027,2023,11000,2021
154,American Samoa,—,—,—,—,11200,2016
30,Andorra,68612,2024,71588,2023,49900,2015
...,...,...,...,...,...,...,...
199,Wallis and Futuna,—,—,—,—,3800,2004
103,World,24567,[i]2024,20946,2022,17000,2021
217,Yemen,1996,2024,3437,2013,2500,2017
195,Zambia,4190,2024,4126,2023,3200,2021


It looks like missing data is denoted by '-', so we could remove these as well to ensure we only have numeric characters in each values column.

One way to do this in `pandas` is using `pd.to_numeric()`

In [146]:
pd.to_numeric(df['IMF'], errors='coerce')      # 'coerce' will turn any non-numeric values into NaN. When we export a dataframe with NaN values to a CSV, they will be empty cells.

216     2116.0
113    21377.0
126    17718.0
154        NaN
30     68612.0
        ...   
199        NaN
103    24567.0
217     1996.0
195     4190.0
189     5071.0
Name: IMF, Length: 230, dtype: float64

In [113]:
df.to_csv('my_data.csv', index=False)

<br>

---

<br>

# Scraping *level 2*

### Compare the market: global car index example

In [52]:
url='https://www.comparethemarket.com/car-insurance/content/global-supercar-index/'

**Note October 2024:** Originally, we targeted `'https://www.comparethemarket.com/car-insurance/content/global-supercar-index/'`. But this URL doesn't return a page anymore. This example has been updated to get the webpage from the `Wayback Machine`. This is a web archive that takes periodic snapshots of pages across the web. We can view the original content at that URL [here](https://web.archive.org/web/20230323232450/https://www.comparethemarket.com/car-insurance/content/global-supercar-index/#expand).

In [27]:
# Define URL to our webpage (through the Wayback Machine)
url = 'https://web.archive.org/web/20230323232450/https://www.comparethemarket.com/car-insurance/content/global-supercar-index/'

We can try to use the pandas' built-in `read_html` function like we did above. This looks for any obvious tables in the HTML.

In [74]:
pd.read_html(url)[0]

[Empty DataFrame
 Columns: [Rank, Country, Local price, GBP, EUR, USD]
 Index: []]

This quick method is failing to extract the data, so we'll need to manually parse the HTML and extract the values using BeautifulSoup.

In [26]:
from bs4 import BeautifulSoup

In [69]:
# Make a request to the webpage, get the HTML content, parse it with BeautifulSoup
html = requests.get(url)
soup = BeautifulSoup(html.content, 'html.parser')

Now we're using the wayback machine, our page HTMl will have some additional content that holds the web archive user interface bits.

By using `inspect element` on the webpage, we can see that it looks like all the original HTML content of the page is contained within `<div class="full-bleed">`, so let's filter for this.

In [59]:
# Extract just the original content of the page.
page = soup.find('div', class_='full-bleed')

In [60]:
# Extract the table from the page
table = soup.find('div',id='cheapest-countries')

In [61]:
# Within our table element, extract the headings
headings = table.find_all('h4')

In [62]:
# Within our table element, extract the values
values = table.find_all('p', class_='lead')
# table.find_all('p', attrs={'class':'lead'}) # equivalent to above

In [63]:
# Extract the text from the headings. Uses 'list comprehension' to loop through each element in the list and extract the text.
headings_list = [i.text for i in headings]

In [64]:
# Extract the text from the values. Applies formatting to remove the £ sign and commas, and converts the values to integers.
values_list = [int(i.text.replace('£','').replace(',','')) for i in values]

In [65]:
# What happens if we print the headings_list and values_list?
pd.DataFrame([headings_list,values_list])

Unnamed: 0,0,1,2
0,Canada,Mexico,United Kingdom
1,136703,139930,155970


It's in wide format. But we can easily change this by *transposing* the dataframe, using `.T` notation.

In [66]:
# Create a DataFrame from our two lists
df1 = pd.DataFrame([headings_list,values_list]).T
df1

Unnamed: 0,0,1
0,Canada,136703
1,Mexico,139930
2,United Kingdom,155970


Now let's repeat those steps for the 'expensive countries'.

In [67]:
table = soup.find('div',id='expensive-countries')
headings = table.find_all('h4')
values=table.find_all('p',class_='lead')
df2 = pd.DataFrame([[i.text for i in headings],
              [int(i.text.replace('£','').replace(',','')) for i in values]]).T

Merge these dataframes using `pd.concat`

In [68]:
pd.concat([df1 , df2])

Unnamed: 0,0,1
0,Canada,136703
1,Mexico,139930
2,United Kingdom,155970
0,Argentina,549840
1,Singapore,526958
2,Vietnam,524620
