****
## Springboard Data Science Career Track JSON exercise
Using data in file 'world_bank_projects.json' and the techniques in examples provided:
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

****
## Import modules and load World Bank dataset
Necesary modules include: numpy, pandas, json
The data are in: 'world_bank_projects.json'

In [1]:
# Import numpy as np, pandas as pd, json, and json_normalize
import numpy as np
import pandas as pd
import json
from pandas.io.json import json_normalize
# Remember that pandas allows you to take seimi-structured data (like a dictionary or list of dictionaries) 
# and 'normalize' it into a flat table

In [2]:
# Load and view the World Bank Projects file as a pandas dataframe
str_data = json.load(open('world_bank_projects.json')) # load json file as a string
df1 = pd.DataFrame(str_data) # load string as a pandas DataFrame
df1.head(15) # view the first 15 rows

Unnamed: 0,_id,approvalfy,board_approval_month,boardapprovaldate,borrower,closingdate,country_namecode,countrycode,countryname,countryshortname,...,sectorcode,source,status,supplementprojectflg,theme1,theme_namecode,themecode,totalamt,totalcommamt,url
0,{'$oid': '52b213b38594d8a2be17c780'},1999,November,2013-11-12T00:00:00Z,FEDERAL DEMOCRATIC REPUBLIC OF ETHIOPIA,2018-07-07T00:00:00Z,Federal Democratic Republic of Ethiopia!$!ET,ET,Federal Democratic Republic of Ethiopia,Ethiopia,...,"ET,BS,ES,EP",IBRD,Active,N,"{'Percent': 100, 'Name': 'Education for all'}","[{'code': '65', 'name': 'Education for all'}]",65.0,130000000,130000000,http://www.worldbank.org/projects/P129828/ethi...
1,{'$oid': '52b213b38594d8a2be17c781'},2015,November,2013-11-04T00:00:00Z,GOVERNMENT OF TUNISIA,,Republic of Tunisia!$!TN,TN,Republic of Tunisia,Tunisia,...,"BZ,BS",IBRD,Active,N,"{'Percent': 30, 'Name': 'Other economic manage...","[{'code': '24', 'name': 'Other economic manage...",5424.0,0,4700000,http://www.worldbank.org/projects/P144674?lang=en
2,{'$oid': '52b213b38594d8a2be17c782'},2014,November,2013-11-01T00:00:00Z,MINISTRY OF FINANCE AND ECONOMIC DEVEL,,Tuvalu!$!TV,TV,Tuvalu,Tuvalu,...,TI,IBRD,Active,Y,"{'Percent': 46, 'Name': 'Regional integration'}","[{'code': '47', 'name': 'Regional integration'...",52812547.0,6060000,6060000,http://www.worldbank.org/projects/P145310?lang=en
3,{'$oid': '52b213b38594d8a2be17c783'},2014,October,2013-10-31T00:00:00Z,MIN. OF PLANNING AND INT'L COOPERATION,,Republic of Yemen!$!RY,RY,Republic of Yemen,"Yemen, Republic of",...,JB,IBRD,Active,N,"{'Percent': 50, 'Name': 'Participation and civ...","[{'code': '57', 'name': 'Participation and civ...",5957.0,0,1500000,http://www.worldbank.org/projects/P144665?lang=en
4,{'$oid': '52b213b38594d8a2be17c784'},2014,October,2013-10-31T00:00:00Z,MINISTRY OF FINANCE,2019-04-30T00:00:00Z,Kingdom of Lesotho!$!LS,LS,Kingdom of Lesotho,Lesotho,...,"FH,YW,YZ",IBRD,Active,N,"{'Percent': 30, 'Name': 'Export development an...","[{'code': '45', 'name': 'Export development an...",4145.0,13100000,13100000,http://www.worldbank.org/projects/P144933/seco...
5,{'$oid': '52b213b38594d8a2be17c785'},2014,October,2013-10-31T00:00:00Z,REPUBLIC OF KENYA,,Republic of Kenya!$!KE,KE,Republic of Kenya,Kenya,...,JB,IBRD,Active,Y,"{'Percent': 100, 'Name': 'Social safety nets'}","[{'code': '54', 'name': 'Social safety nets'}]",54.0,10000000,10000000,http://www.worldbank.org/projects/P146161?lang=en
6,{'$oid': '52b213b38594d8a2be17c786'},2014,October,2013-10-29T00:00:00Z,GOVERNMENT OF INDIA,2019-06-30T00:00:00Z,Republic of India!$!IN,IN,Republic of India,India,...,TI,IBRD,Active,N,"{'Percent': 20, 'Name': 'Administrative and ci...","[{'code': '25', 'name': 'Administrative and ci...",3925.0,500000000,500000000,http://www.worldbank.org/projects/P121185/firs...
7,{'$oid': '52b213b38594d8a2be17c787'},2014,October,2013-10-29T00:00:00Z,PEOPLE'S REPUBLIC OF CHINA,,People's Republic of China!$!CN,CN,People's Republic of China,China,...,LR,IBRD,Active,N,"{'Percent': 100, 'Name': 'Climate change'}","[{'code': '81', 'name': 'Climate change'}]",81.0,0,27280000,http://www.worldbank.org/projects/P127033/chin...
8,{'$oid': '52b213b38594d8a2be17c788'},2014,October,2013-10-29T00:00:00Z,THE GOVERNMENT OF INDIA,2018-12-31T00:00:00Z,Republic of India!$!IN,IN,Republic of India,India,...,TI,IBRD,Active,N,"{'Percent': 87, 'Name': 'Other rural developme...","[{'code': '79', 'name': 'Other rural developme...",79.0,160000000,160000000,http://www.worldbank.org/projects/P130164/raja...
9,{'$oid': '52b213b38594d8a2be17c789'},2014,October,2013-10-29T00:00:00Z,THE KINGDOM OF MOROCCO,2014-12-31T00:00:00Z,Kingdom of Morocco!$!MA,MA,Kingdom of Morocco,Morocco,...,"BM,BC,BZ",IBRD,Active,N,"{'Percent': 33, 'Name': 'Other accountability/...","[{'code': '29', 'name': 'Other accountability/...",273029.0,200000000,200000000,http://www.worldbank.org/projects/P130903?lang=en


****
## Begin by checking numbers of rows, columns, data column names, and data types
### The World Bank Projects JSON file has 500 rows of data and 50 data columns. Columns are predominantly object fields with 6 columns with integer data:

In [3]:
# Explore the pandas DataFrame with '.info()'
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 50 columns):
_id                         500 non-null object
approvalfy                  500 non-null object
board_approval_month        500 non-null object
boardapprovaldate           500 non-null object
borrower                    485 non-null object
closingdate                 370 non-null object
country_namecode            500 non-null object
countrycode                 500 non-null object
countryname                 500 non-null object
countryshortname            500 non-null object
docty                       446 non-null object
envassesmentcategorycode    430 non-null object
grantamt                    500 non-null int64
ibrdcommamt                 500 non-null int64
id                          500 non-null object
idacommamt                  500 non-null int64
impagency                   472 non-null object
lendinginstr                495 non-null object
lendinginstrtype            495 no

### Of the 500 World Bank Projects included in this file the majority of them were approved in 2013:

In [4]:
# Count projects by fiscal year
df1.approvalfy.value_counts().head()

2013    432
2014     66
2015      1
1999      1
Name: approvalfy, dtype: int64

## Q1: The top 10 countries (or continental country groups, e.g., Africa) with the highest counts of World Bank projects range from 11 to 19 projects per country.

In [5]:
# Count World Bank Projects by country short name
# Use .value_counts and .head(10) to count frequency and view top 10
df1.countryshortname.value_counts().head(10)

Indonesia             19
China                 19
Vietnam               17
India                 16
Yemen, Republic of    13
Morocco               12
Bangladesh            12
Nepal                 12
Mozambique            11
Africa                11
Name: countryshortname, dtype: int64

### What is going on in the mjtheme_namecode field?

In [6]:
# Check the list of the first 15 project themes to see format
df1.mjtheme_namecode.head(15)

0     [{'code': '8', 'name': 'Human development'}, {...
1     [{'code': '1', 'name': 'Economic management'},...
2     [{'code': '5', 'name': 'Trade and integration'...
3     [{'code': '7', 'name': 'Social dev/gender/incl...
4     [{'code': '5', 'name': 'Trade and integration'...
5     [{'code': '6', 'name': 'Social protection and ...
6     [{'code': '2', 'name': 'Public sector governan...
7     [{'code': '11', 'name': 'Environment and natur...
8     [{'code': '10', 'name': 'Rural development'}, ...
9     [{'code': '2', 'name': 'Public sector governan...
10    [{'code': '10', 'name': 'Rural development'}, ...
11    [{'code': '10', 'name': 'Rural development'}, ...
12                          [{'code': '4', 'name': ''}]
13    [{'code': '5', 'name': 'Trade and integration'...
14    [{'code': '6', 'name': 'Social protection and ...
Name: mjtheme_namecode, dtype: object

In [7]:
# Load raw json from string (named 'json', above)
# Normalize the json file to extract World Bank project "themes" from 'mjtheme_namecode'
themes = json_normalize(str_data, 'mjtheme_namecode', ['id'])

# set index to code and print first 15 projects
themes = themes.set_index('code')
themes.head(15) # print the first 15 extracted themes divided into 'code', 'name', and project 'id' 

Unnamed: 0_level_0,name,id
code,Unnamed: 1_level_1,Unnamed: 2_level_1
8,Human development,P129828
11,,P129828
1,Economic management,P144674
6,Social protection and risk management,P144674
5,Trade and integration,P145310
2,Public sector governance,P145310
11,Environment and natural resources management,P145310
6,Social protection and risk management,P145310
7,Social dev/gender/inclusion,P144665
7,Social dev/gender/inclusion,P144665


In [8]:
# how many levels of 'code' (which we set to index) are there?
themes.index.value_counts().head(15) # there are 11 codes for major project themes

11    250
10    216
8     210
2     199
6     168
4     146
7     130
5      77
9      50
1      38
3      15
Name: code, dtype: int64

## Q2: The top 10 major project themes and their frequency
### Note that the data set includes a blank theme for 122 projects

In [9]:
# Show the top 10 most frequent major project themes
themes.name.value_counts().head(10)

Environment and natural resources management    223
Rural development                               202
Human development                               197
Public sector governance                        184
Social protection and risk management           158
Financial and private sector development        130
                                                122
Social dev/gender/inclusion                     119
Trade and integration                            72
Urban development                                47
Name: name, dtype: int64

## Q3: For rows where there are data in the 'code' column but the 'name' column is empty, create a data frame with the missing names filled in.

In [19]:
# Begin steps to fill in missing names

# Load json file as string and normalize
proj_name_theme = json_normalize(str_data, record_path='mjtheme_namecode', meta='project_name')
# record_path parameter specifies the path of the lowest-level record I am interested in

# Build a lookup table for 'code' and 'name'
lookup = proj_name_theme[proj_name_theme.name != ""].groupby(['name','code']).size().reset_index(name='size')
del lookup['size']
lookup

Unnamed: 0,name,code
0,Economic management,1
1,Environment and natural resources management,11
2,Financial and private sector development,4
3,Human development,8
4,Public sector governance,2
5,Rule of law,3
6,Rural development,10
7,Social dev/gender/inclusion,7
8,Social protection and risk management,6
9,Trade and integration,5


In [20]:
# Use pd.merge to merge lookup table with full dataset
replace_missing = pd.merge(proj_name_theme, lookup, how='left', left_on='code', right_on='code', suffixes=['','_lookup'])
replace_missing.name = replace_missing.name_lookup
del replace_missing['name_lookup']
replace_missing.head(15)
# First 15 lines of proj_name_theme shows the previously missing 'name' fields are filled with accurate 'name'

Unnamed: 0,code,name,project_name
0,8,Human development,Ethiopia General Education Quality Improvement...
1,11,Environment and natural resources management,Ethiopia General Education Quality Improvement...
2,1,Economic management,TN: DTF Social Protection Reforms Support
3,6,Social protection and risk management,TN: DTF Social Protection Reforms Support
4,5,Trade and integration,Tuvalu Aviation Investment Project - Additiona...
5,2,Public sector governance,Tuvalu Aviation Investment Project - Additiona...
6,11,Environment and natural resources management,Tuvalu Aviation Investment Project - Additiona...
7,6,Social protection and risk management,Tuvalu Aviation Investment Project - Additiona...
8,7,Social dev/gender/inclusion,Gov't and Civil Society Organization Partnership
9,7,Social dev/gender/inclusion,Gov't and Civil Society Organization Partnership


## The updated list of the top ten major project themes for World Bank projects:
### Note that the numbers are adjusted from prior count and there are no empty name fields in the list.

In [21]:
# Rebuild the top 10 most frequent major project themes with the updated themes
replace_missing.name.value_counts().head(10) # note that these counts match the count of the index 'code' above

Environment and natural resources management    250
Rural development                               216
Human development                               210
Public sector governance                        199
Social protection and risk management           168
Financial and private sector development        146
Social dev/gender/inclusion                     130
Trade and integration                            77
Urban development                                50
Economic management                              38
Name: name, dtype: int64