## Storing the countries in Neo4j

We have a master list of ISO countries that we can use to create our base country nodes.

In [1]:
import json
import pandas as pd

In [2]:
iso_data = pd.read_json('./data/iso3166_country_codes.json')

In [3]:
iso_data.head()

Unnamed: 0,alt_name,iso3166_code,name
0,Afghanistan,AF,Afghanistan
1,Aland,AX,Aland Islands
2,Albanie,AL,Albania
3,Algerie,DZ,Algeria
4,Samoa americaines,AS,American Samoa


### Now we can loop over the data and insert each country with it's corresponding code and name

In [4]:
from neo4j.v1 import GraphDatabase
driver = GraphDatabase.driver("bolt://10.0.0.1:7687", auth=("myusername", "mypassword"))

First we create a list of all the new nodes we wish to create and their associated properties. Then we can connect to the neo4j database and run a Cypher command to insert and create new nodes; we use MERGE in case we have any duplicate entries.

In [5]:
countryList = [{'code': row.iso3166_code, 'name': row['name'].strip().upper()} for ind, row in iso_data.iterrows()]

In [49]:
with driver.session() as session:
    session.run(("UNWIND {list} AS d "
                 "MERGE (c:Country {code: d.code, name: d.name})"),
                {"list": countryList})

## Combining the ISO country codes with secrecy data
We want to incorporate finanical secrecy data into our modelling of the network. We used an edited version of the financial secrecy classification of countries but a similar data set is available to the public as an Excel file here:
- Financial Secrecy Index 2018 Results; https://www.financialsecrecyindex.com/introduction/fsi-2018-results

Be warned the data will take a little bit of cleaning up afterwards it can be used in a similar fashion as indicated below.

In [6]:
secrecy_csv = './data/financialSecrecy.csv'

In [25]:
code2name = {row['code']: row['name'] for row in countryList}

### Country & Nationality mapping

In many instances we will need to be able to convert between country and nationality when connecting people and companies to countries.

We have three different source files that we can use, natively stored as JSON we will convert them to pickles for reasy use within python. The 3 files are:

- Country names to country code mappings
- Country code to country name mappings
- Nationality to country code mappings

In [7]:
def convert_score(datum):
    components = datum.split('-')
    if len(components) == 1:
        return float(components[0])
    else:
        return (float(components[0][1:])+float(components[1][:-1]))/2

In [8]:
secrecy_df = pd.read_csv(secrecy_csv)
secrecy_df['final_secrecy_score'] = secrecy_df['Secrecy Score'].map(convert_score)

In [9]:
secrecy_df.head()

Unnamed: 0,RANK,Jurisdiction,(Formula) FSI Value,Secrecy Score,Global Scale Weight,Unnamed: 5,final_secrecy_score
0,NA7,Maldives,-,(76-84),0.0,,80.0
1,NA7,Paraguay,-,(75-83),0.001,,79.0
2,NA7,Gambia,-,(73-81),0.0,,77.0
3,NA7,Tanzania,-,(73-81),0.006,,77.0
4,NA7,Bolivia,-,(72-80),0.001,,76.0


In [70]:
secrecy_df['COUNTRY_CODE'] = secrecy_df.Jurisdiction.map(lambda x: country_names_2_codes.get(x.strip().upper(), 'missing'))

In [71]:
missing_regions = secrecy_df[secrecy_df.COUNTRY_CODE == 'missing'].Jurisdiction.values

In [75]:
missing_regions

array(['Samoa', 'Marshall Islands', 'Malaysia (Labuan)', 'Monaco',
       'Turks & Caicos Islands', 'US Virgin Islands', 'Anguilla',
       'Curacao', 'Montserrat', 'Korea', 'Portugal (Madeira)'], dtype=object)

In [61]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [63]:
for region in missing_regions:
    print(region, process.extract(region, country_names_2_codes.keys(), limit=3))

Samoa [('AMERICAN SAMOA', 90), ('CAMBODIA', 62), ('ZAMBIA', 55)]
St Lucia [('ST. LUCIA', 95), ('ST. VINCENT & GRENADINES', 86), ('ST. KITTS & NEVIS', 86)]
Brunei Darussalam [('BRUNEI', 90), ('RUSSIA', 75), ('BURUNDI', 64)]
Marshall Islands [('CHANNEL ISLANDS', 71), ('COOK ISLANDS', 70), ('ÌÉLAND ISLANDS', 70)]
St Vincent & the Grenadines [('ST. VINCENT & GRENADINES', 95), ('ST. KITTS & NEVIS', 86), ('ST. LUCIA', 86)]
St Kitts & Nevis [('ST. KITTS & NEVIS', 97), ('ST. VINCENT & GRENADINES', 86), ('ST. LUCIA', 86)]
United Arab Emirates (Dubai) [('UNITED ARAB EMIRATES', 95), ('UNITED KINGDOM', 86), ('UNITED STATES', 86)]
Malaysia (Labuan) [('MALAYSIA', 90), ('MALTA', 72), ('ALBANIA', 64)]
Monaco [('MOROCCO', 62), ('MACEDONIA', 60), ('NAN', 60)]
Turks & Caicos Islands [('CAYMAN ISLANDS', 86), ('COOK ISLANDS', 86), ('ÌÉLAND ISLANDS', 86)]
Macao [('MACAU', 80), ('JAMAICA', 67), ('MALTA', 60)]
US Virgin Islands [('BRITISH VIRGIN ISLANDS', 86), ('COOK ISLANDS', 70), ('ÌÉLAND ISLANDS', 70)]
Ang

These are certainly things we are missing and need to be added to our data

- St Lucia == ST. LUCIA
- Brunei Darussalam == BRUNEI
- St Vincent & the Grenadines == ST. VINCENT & GRENADINES
- St Kitts & Nevis == ST. KITTS & NEVIS
- United Arab Emirates (Dubai) == UNITED ARAB EMIRATES
- Macao == MACAU
- Czech Republic == CZECHIA

In [67]:
extra_names = ['St Lucia', 'Brunei Darussalam', 'St Vincent & the Grenadines', 'St Kitts & Nevis', 'United Arab Emirates (Dubai)', 'Macao', 'Czech Republic']
extra_names = [name.upper() for name in extra_names]
name4code = ['ST. LUCIA', 'BRUNEI', 'ST. VINCENT & GRENADINES', 'ST. KITTS & NEVIS', 'UNITED ARAB EMIRATES', 'MACAU', 'CZECHIA']

extra_names2codes = {vals[0]: country_names_2_codes.get(vals[1]) for vals in zip(extra_names, name4code)}

In [68]:
country_names_2_codes.update(extra_names2codes)

In [69]:
pd.to_pickle(country_names_2_codes, "data/combined_country_map.pkl")

If we then re-run the above to get the new secrecy numbers we get a new dataframe that we can use to enrich the country nodes.

In [77]:
filtered_df = secrecy_df[secrecy_df.COUNTRY_CODE != 'missing']

In [79]:
with driver.session() as session:
    for i, row in filtered_df.iterrows():
        print(row['COUNTRY_CODE'], row['final_secrecy_score'])
        result = session.run("MATCH (c:Country {code: {code}}) SET c.secrecy_score = {score}", 
                             code=row['COUNTRY_CODE'], 
                             score=row['final_secrecy_score'])

MV 80.0
PY 79.0
GM 77.0
TZ 77.0
BO 76.0
TW 71.0
DO 69.0
VE 68.0
ME 64.0
VU 87.0
LR 83.0
LC 83.0
BN 83.0
AG 81.0
LB 79.0
BS 79.0
BZ 79.0
NR 78.91
BB 78.0
VC 78.0
KN 78.0
AE 77.0
AD 77.0
LI 76.0
GT 76.0
GD 76.0
DM 76.0
CK 76.0
BH 74.0
CH 73.0
HK 72.0
PA 72.0
MU 72.0
UY 71.0
BW 71.0
SC 71.0
MO 70.0
SM 70.0
SG 69.0
AW 68.0
GH 67.0
GI 67.0
BM 66.0
MK 66.0
KY 65.0
JE 65.0
GG 64.0
TR 64.0
IM 64.0
PH 63.0
SA 61.0
US 60.0
VG 60.0
JP 58.0
DE 56.0
LU 55.0
CR 55.0
CN 54.0
AT 54.0
RU 54.0
CL 54.0
IL 53.0
BR 52.0
MT 50.0
CY 50.0
SK 50.0
NL 48.0
CA 46.0
NZ 46.0
IS 46.0
MX 45.0
LV 45.0
EE 44.0
FR 43.0
AU 43.0
ZA 42.0
GB 41.0
BE 41.0
IE 40.0
IN 39.0
NO 38.0
SE 36.0
PL 36.0
HU 36.0
GR 36.0
IT 35.0
CZ 35.0
SI 34.0
ES 33.0
DK 31.0
FI 31.0


### Country codes to names

In [12]:
country_codes_2_names = json.load(open('data/clean_country_code_map.json', 'r'))

From our data we created clean country codes, do they match with what ISO data we have?

In [26]:
for key in country_codes_2_names.keys():
    result = code2name.get(key, '**')
    if result == '**':
        print(key, country_codes_2_names.get(key))

DQ DQ
XK KOSOVO


It looks like we are missing Kosovo which has code XK. DQ seems to be a mistake as it is Dominica and has real code DM. So let's add Kosovo and then save the output to a pkl.

In [29]:
code2name['XK'] = 'KOSOVO'

In [30]:
pd.to_pickle(code2name, "data/clean_country_code_map.pkl")

Let's also fix the DQ in the country_codes_2_names file ...

In [32]:
country_codes_2_names.pop('DQ', None)
country_codes_2_names['DM'] = code2name.get('DM')
print(country_codes_2_names['DM'])

DOMINICA


### Country names to codes

We also need a reverse lookup and since there are a variety of ways of doing the lookup let's integrate all the names we have to their respective codes.

In [36]:
country_names_2_codes = json.load(open("data/country_name_2_code_map.json", "r"))

In [40]:
country_names_2_codes = {str(k).upper(): str(v).upper()for k,v in country_names_2_codes.items()}
country_names_2_codes.get('DQ')
country_names_2_codes['DQ'] = 'DM'

In [41]:
print(len(country_codes_2_names))
country_codes_2_names.update({v: k for k,v in code2name.items()})
print(len(country_codes_2_names))
country_codes_2_names.update({v: k for k,v in country_codes_2_names.items()})
print(len(country_codes_2_names))

195
442
519


In [42]:
pd.to_pickle(country_names_2_codes, "data/combined_country_map.pkl")

### Nationality

In [45]:
nationality_2_codes = json.load(open("data/nationality_map.json", "r"))

In [50]:
nationality_2_codes.pop('(blank)', None)
nationality_2_codes['domincan'] = 'DM'

In [51]:
pd.to_pickle(nationality_2_codes, "data/nation_map.pkl")