# First Exercise: Learning How to Import JSON Files. 

*The goal of this exercise is to learn how to manipulate data inside JSON files. I was given the world_bank_projects file to complete this project. Let's see if we can find anything interesting in this dataset, shall we? <br>
<br>
I decided to import json, pandas, and numpy as part of this project.*

In [124]:
from pandas.io.json import json_normalize
import json
import pandas as pd
import numpy as np

*We are ready to import the json file and read it in as json_df. In this particular situation, I will be using pd.read_json because it gives me automatically a Pandas' dataframe. I use the .info() function to verify the types of each individual variable since there are a lot of them!*

In [125]:
json_df = pd.read_json('data/world_bank_projects.json')
json_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 0 to 499
Data columns (total 50 columns):
_id                         500 non-null object
approvalfy                  500 non-null int64
board_approval_month        500 non-null object
boardapprovaldate           500 non-null object
borrower                    485 non-null object
closingdate                 370 non-null object
country_namecode            500 non-null object
countrycode                 500 non-null object
countryname                 500 non-null object
countryshortname            500 non-null object
docty                       446 non-null object
envassesmentcategorycode    430 non-null object
grantamt                    500 non-null int64
ibrdcommamt                 500 non-null int64
id                          500 non-null object
idacommamt                  500 non-null int64
impagency                   472 non-null object
lendinginstr                495 non-null object
lendinginstrtype            495 non

*Looks good! Let's see if I can figure out what the best top ten countries are. First, I will use value_counts() function on the dataframe. I am using this function over Counter from collections because it will give me back a numpy array and I want to continue working in dataframes.*  

In [126]:
project_count = json_df['countryshortname'].value_counts()
print project_count.head(n=10)

China                 19
Indonesia             19
Vietnam               17
India                 16
Yemen, Republic of    13
Morocco               12
Bangladesh            12
Nepal                 12
Africa                11
Mozambique            11
Name: countryshortname, dtype: int64


*Hmm interesting. Africa is not a country; maybe, I should explore that. <br>
<br>
I started out by showing the top 20 countries who have the most projects. The table now includes other countries that are part of Africa such as Tanzaniaw who has 8 projects. After finding the borrowers' names and only including the ones whose country name is Africa, I could see that Tanzania has two more projects. This means that Africa would now have 2 less projects and would drop below Tanzania, who would now have 10 projects.*

In [127]:
print project_count.head(n=20)
print json_df.borrower[json_df['countryname'] == 'Africa']

China                               19
Indonesia                           19
Vietnam                             17
India                               16
Yemen, Republic of                  13
Morocco                             12
Bangladesh                          12
Nepal                               12
Africa                              11
Mozambique                          11
Brazil                               9
Burkina Faso                         9
Pakistan                             9
Armenia                              8
Tajikistan                           8
Tanzania                             8
Kyrgyz Republic                      7
Nigeria                              7
Lao People's Democratic Republic     7
Jordan                               7
Name: countryshortname, dtype: int64
45                            ECOWAS
46                    UGANDA-COMOROS
51                  OSS, IUCN, CILSS
58                     BANK EXECUTED
65           BURUNDI,RWANDA,TANZANI

## It's time to move on to the top number project areas or themes in projects.

*I initially tried a similar method as I did in the previous section. However, I quickly found out that it does not work as I thought it would (nice and clean as in the first section).*


In [128]:
themes_count = json_df['mjtheme_namecode'].value_counts()
print themes_count.head(n=10)

[{u'code': u'11', u'name': u'Environment and natural resources management'}, {u'code': u'11', u'name': u'Environment and natural resources management'}]                                                                                            12
[{u'code': u'8', u'name': u'Human development'}, {u'code': u'11', u'name': u''}]                                                                                                                                                                    11
[{u'code': u'8', u'name': u'Human development'}, {u'code': u'8', u'name': u'Human development'}]                                                                                                                                                     8
[{u'code': u'4', u'name': u'Financial and private sector development'}, {u'code': u'4', u'name': u'Financial and private sector development'}]                                                                                                       6
[{u'code': u

*Thus, I decided a different method for this section. I used json.load and json_normalize because these functions quickly clean the data and transition the data into a table.*

In [129]:
json_df = json.load((open('data/world_bank_projects.json')))
json_df = json_normalize(json_df, 'mjtheme_namecode')

In [130]:
print json_df.head(10)

  code                                          name
0    8                             Human development
1   11                                              
2    1                           Economic management
3    6         Social protection and risk management
4    5                         Trade and integration
5    2                      Public sector governance
6   11  Environment and natural resources management
7    6         Social protection and risk management
8    7                   Social dev/gender/inclusion
9    7                   Social dev/gender/inclusion


*When I printed out the top ten results, I realized that some of the name sections were empty. Thus, it was vital to fill these in. I started off by sorting the code values because the same code would have the same project theme.* 

In [131]:
json_df = json_df.sort_values(['code', 'name'])
print json_df.head(n=10)

     code                 name
212     1                     
363     1                     
1024    1                     
1114    1                     
1437    1                     
2       1  Economic management
88      1  Economic management
175     1  Economic management
204     1  Economic management
205     1  Economic management


*I then searched for all the values in the name column where the name value is empty and filled it with NaN. Lastly, I was able to fill these 'NaN' values backwards with .fillna() to get the following clean tables.* 

In [132]:
json_df.name[json_df['name'] == ''] = np.nan 
json_df = json_df.fillna(method='bfill')
print json_df.head(10)

     code                 name
212     1  Economic management
363     1  Economic management
1024    1  Economic management
1114    1  Economic management
1437    1  Economic management
2       1  Economic management
88      1  Economic management
175     1  Economic management
204     1  Economic management
205     1  Economic management


In [137]:
json_df.name.value_counts().head(n=10)

Environment and natural resources management    250
Rural development                               216
Human development                               210
Public sector governance                        199
Social protection and risk management           168
Financial and private sector development        146
Social dev/gender/inclusion                     130
Trade and integration                            77
Urban development                                50
Economic management                              38
Name: name, dtype: int64

### Conclusion


*I have listed the top ten projects and project themes in the tables below.*



|     Rank     |   Project     | 
| ------------ |  -----------  | 
|1             |   China            |
|2             |   Indonesia            | 
|3             |   Vietnam            |
|4             |   India            | 
|5             |   Yemen, Republic of            |
|6             |   Morocco            | 
|7             |   Bangladesh            |
|8             |   Nepal            | 
|9             |   Mozambique            |
|10            |   Tanzania        | 


|     Rank     |  Project Theme     | 
| ------------ |  -----------  | 
|1             |   Environment and natural resources management            |
|2             |   Rural development         | 
|3             |   Human development            |
|4             |   Public sector governance           | 
|5             |   Social protection and risk management            |
|6             |   Financial and private sector development            | 
|7             |   Social dev/gender/inclusion            |
|8             |   Trade and integration                | 
|9             |   Urban development            |
|10            |   Economic management         | 
