# World Bank CCDR
Extracting data from the [World Bank](https://www.worldbank.org/)'s [Country Climate and Development Report (CCDR)](https://databank.worldbank.org/source/country-climate-and-development-report-(ccdr))  
The report is an annual time series, per country, of climate and development features.  
The World Bank provides APIs to access the data. More information can be found on the [Developer Information](https://datahelpdesk.worldbank.org/knowledgebase/topics/125589) and [Data Catalog API](https://datahelpdesk.worldbank.org/knowledgebase/articles/1886698-data-catalog-api) pages


In [None]:
import io
import json
import os
import requests
import zipfile

import pandas as pd

import IPython.display

pd.set_option('display.max_columns', None)

## CCDR dataset ID: '0061107'
Search the CCDR dataset in the worldbank catalog.  
Once the dataset is found, look for its unique id.  
Not need to re-run this section: the unique ID is not supposed to change, and we know it's '0061107'

In [None]:
r = requests.get("https://datacatalogapi.worldbank.org/ddhxext/Search?qname=dataset&qterm=ccdr&$filter=(Resources/any(res:res/format+eq+%27API%27))")
rd = json.loads(r.text)

In [None]:
# Make sure the query returned only 1 dataset
assert len(rd['Response']['value']) == 1

In [None]:
IPython.display.JSON(rd['Response']['value'][0])

In [None]:
ccdr_unique_id = rd['Response']['value'][0]['dataset_unique_id']
print(f"ccdr_unique_id: {ccdr_unique_id}")

In [None]:
# Check this is the expected ID, i.e. the one this notebook expects
assert ccdr_unique_id == '0061107'

## Dataset metadata
Runs the query to get these URLs:
- The [list of indicators](https://api.worldbank.org/v2/sources/87/indicators), i.e. the "columns" and their description
- The [data files (CSV zip)](https://databank.worldbank.org/data/download/CCDR_csv.zip)
- The [data files (Excel zip)](https://databank.worldbank.org/data/download/CCDR_excel.zip)

In [None]:
ccdr_unique_id = '0061107'

In [None]:
# Omitting param `version_id` in the query parameter to get the latest version
# Hint: version information is provided in the `maintenance_information` object of a dataset.
preview_query = f"https://datacatalogapi.worldbank.org/ddhxext/DatasetView?dataset_unique_id={ccdr_unique_id}"
r = requests.get(preview_query)
rd = json.loads(r.text)

In [None]:
IPython.display.JSON(rd)

In [None]:
# Check info about the last version of this dataset
# At the time of writing:
# {'version': '2',
# 'version_label': 'wdr check',
# 'version_id': '2022-02-14T14:50:48.1328279Z',
# 'uuid': '8d9c3141-1f8d-ec11-93b0-000d3a3b49e6',
# 'version_date': '2022-02-14T14:50:48+00:00'}
IPython.display.JSON(rd['maintenance_information']['version_history'][-1])

In [None]:
resources = rd['Resources']
for resource in resources:
    website_url = resource['website_url']
    if 'indicators' in website_url:
        indicators_url = website_url.strip()
    elif 'csv' in website_url:
        # strip because the url contains an extra space in the end
        csv_zip_url = website_url.strip()
print(f"Indicators: {indicators_url}")
print(f"csv_zip_url: {csv_zip_url}")

## Download the data
The zip file contains:
- CCDRData.csv
- CCDRCountry.csv
- CCDRSeries.csv
- CCDRCountry-Series.csv
- CCDRSeries-Time.csv
- CCDRFootNote.csv

In [None]:
# Assuming the data hasn't move, it available here:
csv_zip_url = "https://databank.worldbank.org/data/download/CCDR_csv.zip"
# In the zip file, the data is in:
csv_filename = 'CCDRData.csv'
# We like to download it to:
data_dir = "data"
worldbank_dir =  os.path.join(data_dir, "worldbank")
csv_full_filename = os.path.join(worldbank_dir, csv_filename)
csv_full_filename

In [None]:
# Download and extract the data file

# Set to True if you've never downloaded the data or if you want to overwrite it
download = False
if download:
    r = requests.get(csv_zip_url)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall(worldbank_dir)

## Load the dataset

In [None]:
df = pd.read_csv(csv_full_filename)
df.head()

In [None]:
df.shape

## Convert
Conver the dataset to:
- 1 row per country and year
- 1 column per indicator code

In [None]:
# Drop "future" columns
future_years = list(range(2023, 2051))
future_years.append(2100)
future_years = list(map(str, future_years))

df.drop(future_years, axis=1, inplace=True)

In [None]:
# Drop unused columns
unused_columns = ['Indicator Name','Unnamed: 96']
df.drop(unused_columns, axis=1, inplace=True)

In [None]:
# Rename indicator codes
df['Indicator Code'] = df['Indicator Code'].str.replace('.','_')

In [None]:
# Remove spaces from column names
df.rename(columns={'Country Code': 'country_code',
                   'Country Name': 'country_name',
                   'Indicator Code': 'indicator_code'},
          inplace=True)

In [None]:
# Unpivot a DataFrame from wide to long format
df = df.melt(id_vars=['country_code', 'country_name', 'indicator_code'],
             var_name='year',
             value_name='value')
df = df.pivot(index=['country_code', 'country_name', 'year'],
             columns='indicator_code',
             values='value').reset_index()
df.head()

## Save
Save to .csv file

In [None]:
output_filename = os.path.join(worldbank_dir, 'ccdr.csv')
df.to_csv(output_filename, encoding='utf-8', index=False)

## Check

In [None]:
fra_df = df[df['country_code'] == 'FRA']

In [None]:
fra_df

In [None]:
# Indicator Name	GDP (current US$) (NY.GDP.MKTP.CD)
fra_df[['country_code', 'country_name', 'year', 'NY_GDP_MKTP_CD']]