# JSON Data Wrangling with World Bank Projects list 
****

Using the JSON dataset of World Bank supported projects, we are have the folowing goals:

1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

****
+ Funded WB Projects data source: data/world_bank_projects.json
****

## Import & Inspect World Bank Projects JSON
****


### Imports for Python

In [1]:
import pandas as pd
import json 
from pandas.io.json import json_normalize
import numpy as np

### Load as DataFrame & Inspect JSON
****
Read JSON file as a list of DataFrame and inspect the first few rows.

In [2]:
# load as Pandas dataframe
project_json_df = pd.read_json('data/world_bank_projects.json')

# Set options to show ALL of the columns
pd.options.display.max_columns = 50

#show first 3 lines
project_json_df.head(13).tail(3)

Unnamed: 0,_id,approvalfy,board_approval_month,boardapprovaldate,borrower,closingdate,country_namecode,countrycode,countryname,countryshortname,docty,envassesmentcategorycode,grantamt,ibrdcommamt,id,idacommamt,impagency,lendinginstr,lendinginstrtype,lendprojectcost,majorsector_percent,mjsector_namecode,mjtheme,mjtheme_namecode,mjthemecode,prodline,prodlinetext,productlinetype,project_abstract,project_name,projectdocs,projectfinancialtype,projectstatusdisplay,regionname,sector,sector1,sector2,sector3,sector4,sector_namecode,sectorcode,source,status,supplementprojectflg,theme1,theme_namecode,themecode,totalamt,totalcommamt,url
10,{'$oid': '52b213b38594d8a2be17c78a'},2014,October,2013-10-25T00:00:00Z,GOVERNMENT OF SOUTH SUDAN,,Republic of South Sudan!$!SS,SS,Republic of South Sudan,South Sudan,"Project Paper,Project Information Document",B,7530000,0,P145339,0,"MINISTRY OF AGRICULTURE, COOPERATIVES AND RURA...",Specific Investment Loan,IN,7530000,"[{'Percent': 50, 'Name': 'Agriculture, fishing...","[{'code': 'AX', 'name': 'Agriculture, fishing,...",[Rural development],"[{'code': '10', 'name': 'Rural development'}, ...",102,RE,Recipient Executed Activities,L,{'cdata': 'The development objective of the Ad...,Southern Sudan Emergency Food Crisis Response ...,"[{'DocDate': '01-OCT-2013', 'EntityID': '00044...",OTHER,Active,Africa,"[{'Name': 'Crops'}, {'Name': 'Other social ser...","{'Percent': 50, 'Name': 'Crops'}","{'Percent': 30, 'Name': 'Other social services'}","{'Percent': 20, 'Name': 'General agriculture, ...",,"[{'code': 'AH', 'name': 'Crops'}, {'code': 'JB...","AZ,JB,AH",IBRD,Active,Y,"{'Percent': 100, 'Name': 'Global food crisis r...","[{'code': '91', 'name': 'Global food crisis re...",91.0,0,7530000,http://www.worldbank.org/projects/P145339?lang=en
11,{'$oid': '52b213b38594d8a2be17c78b'},2014,October,2013-10-25T00:00:00Z,,2017-12-31T00:00:00Z,Republic of India!$!IN,IN,Republic of India,India,"Project Appraisal Document,Environmental Asses...",B,0,0,P146653,250000000,,Investment Project Financing,IN,250000000,"[{'Percent': 60, 'Name': 'Transportation'}, {'...","[{'code': 'TX', 'name': 'Transportation'}, {'c...","[Rural development, Social protection and risk...","[{'code': '10', 'name': 'Rural development'}, ...",106611,PE,IBRD/IDA,L,{'cdata': 'The objective of the Uttarakhand Di...,Uttarakhand Disaster Recovery Project,"[{'DocDate': '11-OCT-2013', 'EntityID': '00033...",IDA,Active,South Asia,[{'Name': 'Rural and Inter-Urban Roads and Hig...,"{'Percent': 60, 'Name': 'Rural and Inter-Urban...","{'Percent': 25, 'Name': 'Flood protection'}","{'Percent': 10, 'Name': 'Housing construction'}","{'Percent': 5, 'Name': 'Other social services'}","[{'code': 'TI', 'name': 'Rural and Inter-Urban...","JB,YC,WD,TI",IBRD,Active,N,"{'Percent': 60, 'Name': 'Rural services and in...","[{'code': '78', 'name': 'Rural services and in...",81875278.0,250000000,250000000,http://www.worldbank.org/projects/P146653?lang=en
12,{'$oid': '52b213b38594d8a2be17c78c'},2014,October,2013-10-24T00:00:00Z,GOVERNMENT OF GHANA,2019-06-30T00:00:00Z,Republic of Ghana!$!GH,GH,Republic of Ghana,Ghana,"Project Appraisal Document,Integrated Safeguar...",C,0,0,P144140,97000000,MINISTRY OF COMMUNICATIONS,Specific Investment Loan,IN,97000000,"[{'Percent': 100, 'Name': 'Information and com...","[{'code': 'CX', 'name': 'Information and commu...",,"[{'code': '4', 'name': ''}]",4,PE,IBRD/IDA,L,{'cdata': 'The development objective of the e-...,GH eTransform Ghana,"[{'DocDate': '26-SEP-2013', 'EntityID': '00045...",IDA,Active,Africa,[{'Name': 'General information and communicati...,"{'Percent': 100, 'Name': 'General information ...",,,,"[{'code': 'CZ', 'name': 'General information a...",CZ,IBRD,Active,N,"{'Percent': 0, 'Name': ''}",,,97000000,97000000,http://www.worldbank.org/projects/P144140/gh-e...


### General Information on World Bank Project Dataset
****

In [3]:
project_json_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 0 to 499
Data columns (total 50 columns):
_id                         500 non-null object
approvalfy                  500 non-null int64
board_approval_month        500 non-null object
boardapprovaldate           500 non-null object
borrower                    485 non-null object
closingdate                 370 non-null object
country_namecode            500 non-null object
countrycode                 500 non-null object
countryname                 500 non-null object
countryshortname            500 non-null object
docty                       446 non-null object
envassesmentcategorycode    430 non-null object
grantamt                    500 non-null int64
ibrdcommamt                 500 non-null int64
id                          500 non-null object
idacommamt                  500 non-null int64
impagency                   472 non-null object
lendinginstr                495 non-null object
lendinginstrtype            495 non

## Inspect: Top Ten Coutries with most Projects in World Banks Projects List
****
1. Find the 10 countries with most projects
View top 20 entries in contryname. Info above shows that countryname has 500 out of 500 populated and without NaN.

In [4]:
# show first 10 countries (1 to 10) the most projects
project_json_df.countryname.value_counts(dropna=False).head(10)

Republic of Indonesia              19
People's Republic of China         19
Socialist Republic of Vietnam      17
Republic of India                  16
Republic of Yemen                  13
Kingdom of Morocco                 12
People's Republic of Bangladesh    12
Nepal                              12
Republic of Mozambique             11
Africa                             11
Name: countryname, dtype: int64

In [5]:
# show next 10 countries (11 to 20) with the most projects
project_json_df.countryname.value_counts(dropna=False).head(20).tail(10)

Burkina Faso                        9
Federative Republic of Brazil       9
Islamic Republic of Pakistan        9
Republic of Tajikistan              8
United Republic of Tanzania         8
Republic of Armenia                 8
Lao People's Democratic Republic    7
Kyrgyz Republic                     7
Federal Republic of Nigeria         7
Hashemite Kingdom of Jordan         7
Name: countryname, dtype: int64

Nice count break between the 10th and 11th items, but Africa is a continent.  So, the Africa will be removed for the country count. This leaves a three way tie for tenth place between  Burkina Fasco, Pakistan, and Brazil. 
A quick check shows that Burkina Faso is a landlocked country in West Africa. (https://en.wikipedia.org/wiki/Burkina_Faso  &  https://www.cia.gov/library/publications/the-world-factbook/geos/uv.html) 
A double check on Nepal showed that Nepal is also an independent landloacked country. (https://en.wikipedia.org/wiki/Nepal & https://www.cia.gov/library/publications/the-world-factbook/geos/np.html)

### Double check countrycode on correlation to countryname.

In [6]:
project_json_df.countrycode.value_counts(dropna=False).head(10)

CN    19
ID    19
VN    17
IN    16
RY    13
MA    12
BD    12
NP    12
MZ    11
3A    11
Name: countrycode, dtype: int64

In [7]:
project_json_df.countrycode.value_counts(dropna=False).head(20).tail(10)

BR    9
BF    9
PK    9
TJ    8
TZ    8
AM    8
NG    7
KG    7
LA    7
JO    7
Name: countrycode, dtype: int64

The top 20 countrycode do correlates to the countryname.

***

## Final: Top Ten Countries with most Projects in World Banks Projects List
****
There is a three way tie for 10th for countries with the most projects.

In [8]:
# top 10 coutries skipping Africa & including 3-way tie for 10th place
project_json_df.loc[project_json_df['countryname']!='Africa'].countryname.value_counts(dropna=False).head(12)

Republic of Indonesia              19
People's Republic of China         19
Socialist Republic of Vietnam      17
Republic of India                  16
Republic of Yemen                  13
People's Republic of Bangladesh    12
Nepal                              12
Kingdom of Morocco                 12
Republic of Mozambique             11
Burkina Faso                        9
Islamic Republic of Pakistan        9
Federative Republic of Brazil       9
Name: countryname, dtype: int64

****
**** 
## Top Ten Major Project Themes in World Banks Projects List

### Inspect: Top Major project Themes
****
+ Find the top 10 major project themes (using column 'mjtheme_namecode')
+ In item above, you will notice that some entries have only the code and the name is missing. Create a data frame with the missing names filled in

Major theme codes and names are nest within 'mjtheme_namecode'. To flatten the data, the JSON file will be reloaded as a string. Note that the earlier Info showed mjtheme_namecode has populated 500 out of 500 rows.

In [9]:
# load JSON record(s) as string
project_json_str = json.load((open('data/world_bank_projects.json')))
#project_json_str[1:3]

# Flatten the mjtheme_namecode column
theme_df = json_normalize(project_json_str, 'mjtheme_namecode')
theme_df.head(18)

Unnamed: 0,code,name
0,8,Human development
1,11,
2,1,Economic management
3,6,Social protection and risk management
4,5,Trade and integration
5,2,Public sector governance
6,11,Environment and natural resources management
7,6,Social protection and risk management
8,7,Social dev/gender/inclusion
9,7,Social dev/gender/inclusion


As expected,some name entries are blank. The major theme codes will be use to find missing names.

### Top WB Project Theme Codes
****
A quick look at the top major theme codes should give quick insight on top major themes names.

In [10]:
theme_df.code.value_counts(dropna=False)

11    250
10    216
8     210
2     199
6     168
4     146
7     130
5      77
9      50
1      38
3      15
Name: code, dtype: int64

In [11]:
theme_df.code.value_counts(dropna=False).count()

11

There are only 11 Major Theme Codes in the JSON's data, with Major Theme Code of 3 clearly in 11th place.

***
***
## Filling in the Missing Major Theme Names
The flatten data set will be sorted. Empty Theme Name cells will be replaced with NULL values, and replaced with correct values by backfilling.

In [12]:
theme_df = theme_df.sort_values(['code','name'])
theme_df.head(9)

Unnamed: 0,code,name
212,1,
363,1,
1024,1,
1114,1,
1437,1,
2,1,Economic management
88,1,Economic management
175,1,Economic management
204,1,Economic management


In [13]:
theme_df.name[theme_df['name']==''] = np.nan
theme_df.head(9)

Unnamed: 0,code,name
212,1,
363,1,
1024,1,
1114,1,
1437,1,
2,1,Economic management
88,1,Economic management
175,1,Economic management
204,1,Economic management


In [14]:
theme_df = theme_df.fillna(method='bfill')
theme_df.head(9)

Unnamed: 0,code,name
212,1,Economic management
363,1,Economic management
1024,1,Economic management
1114,1,Economic management
1437,1,Economic management
2,1,Economic management
88,1,Economic management
175,1,Economic management
204,1,Economic management


## Final: Top Ten Major Project Themes in World Banks Projects List

In [15]:
theme_df.name.value_counts(dropna=False).head(10)

Environment and natural resources management    250
Rural development                               216
Human development                               210
Public sector governance                        199
Social protection and risk management           168
Financial and private sector development        146
Social dev/gender/inclusion                     130
Trade and integration                            77
Urban development                                50
Economic management                              38
Name: name, dtype: int64

****
****
# Summary
***
## Load and Started Wrangling JSON data set
+ The World Bank Project JSON dataset was loaded into DataFrme and inspected.
+ All 50 columns were viewed for the first few rows. Info() was used to view structure and discover NaN values.

## Top 10 Countries with Most WB Projects
+ Found the Top 10 Countries with the most WB projects. 
+ There was a **three way tie** for 10th place

Countries | Project Count
:--- | ---:
Republic of Indonesia            |  19
People's Republic of China       |  19
Socialist Republic of Vietnam    |  17
Republic of India                |  16
Republic of Yemen                |  13
Nepal                            |  12
People's Republic of Bangladesh  |  12
Kingdom of Morocco               |  12
Republic of Mozambique           |  11
Islamic Republic of Pakistan     |   9
Federative Republic of Brazil    |   9
Burkina Faso                     |   9

## Flatten Sunlist and Continued to Data Wrangle
+ Read in the WB Project JSON as a string and flatten the sublist containing the major theme codes and names.
+ Cleaned up missing major theme names by back filling a sorted list.

## Top 10 Major Project Themes
+ Found the Top 10 Major Project Themes.

Theme Names | Theme Count
:--- | ---:
Environment and natural resources management  |  250
Rural development                             |  216
Human development                             |  210
Public sector governance                      |  199
Social protection and risk management         |  168
Financial and private sector development      |  146
Social dev/gender/inclusion                   |  130
Trade and integration                         |   77
Urban development                             |   50
Economic management                           |   38
