****
## JSON exercise

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

In [1]:
# Import packages
import pandas as pd
import json
from pandas.io.json import json_normalize
import numpy as np

In [2]:
# Load data from JSON file into list
projects = json.load((open('data/world_bank_projects.json')))

In [3]:
# Create dataframe from list
projects_df = pd.DataFrame(projects)

In [4]:
# Find the 10 countries with the most projects
projects_df.countryshortname.value_counts().head(10)

Indonesia             19
China                 19
Vietnam               17
India                 16
Yemen, Republic of    13
Nepal                 12
Morocco               12
Bangladesh            12
Africa                11
Mozambique            11
Name: countryshortname, dtype: int64

In [5]:
# Find the top 10 major project themes
themes = json_normalize(projects, 'mjtheme_namecode')
themes_df = pd.DataFrame(themes)
themes_df.code.value_counts().head(10)

11    250
10    216
8     210
2     199
6     168
4     146
7     130
5      77
9      50
1      38
Name: code, dtype: int64

In [6]:
# Value counts of theme names before filling empty name values
themes_df.name.value_counts()

Environment and natural resources management    223
Rural development                               202
Human development                               197
Public sector governance                        184
Social protection and risk management           158
Financial and private sector development        130
                                                122
Social dev/gender/inclusion                     119
Trade and integration                            72
Urban development                                47
Economic management                              33
Rule of law                                      12
Name: name, dtype: int64

In [7]:
# Fill empty name values in project themes dataframe
themes_df = themes_df.sort_values(['code','name'])
themes_df = themes_df.replace(r'^\s*$', np.nan, regex=True)
themes_df = themes_df.fillna(method='bfill')

In [8]:
# Value counts of theme codes after filling empty name values
themes_df.code.value_counts()

11    250
10    216
8     210
2     199
6     168
4     146
7     130
5      77
9      50
1      38
3      15
Name: code, dtype: int64

In [9]:
# Value counts of theme names after filling empty name values
themes_df.name.value_counts()

Environment and natural resources management    250
Rural development                               216
Human development                               210
Public sector governance                        199
Social protection and risk management           168
Financial and private sector development        146
Social dev/gender/inclusion                     130
Trade and integration                            77
Urban development                                50
Economic management                              38
Rule of law                                      15
Name: name, dtype: int64