# General Dutch Survey

This notebook searches for any Dutch references in the offshore leaks database.

## Preliminaries

Please note the general set-up requirements contained in the repo readme and the _showcase_ notebook.
Remember that the database needs to be running locally for this workbook to work.

In [None]:
#imports
import os  #to find the settings file(s)
import csv #to process the settings file(s)
import shutil #to copy the settings file (if needed)
from neo4j import GraphDatabase
import pandas as pd

In [None]:
#get settings
settings_dir = os.path.join("..","settings")
personal_settings = os.path.join(settings_dir,"personal_settings.csv")
if not "personal_settings.csv" in os.listdir(settings_dir):
    default_settings = os.path.join(settings_dir,"default_settings.csv")
    shutil.copy(default_settings, personal_settings)
    print("Created new personal settings file, this probably needs to be edited before proceeding.")
with open(personal_settings, mode = 'r') as file:
    user_settings = {}
    for line in csv.DictReader(file):
        user_settings[line['setting']] = line['value']
db_uri = "bolt://localhost:" + str(user_settings['port_number'])

In [None]:
#data path
data_root = os.path.join("..","data")
data_david = os.path.join(data_root,"extracts","david")

In [None]:
db_connection = GraphDatabase.driver(db_uri, auth=(user_settings['username'],user_settings['password']))

In [None]:
db_session = db_connection.session(database=user_settings['db_name'])

In [None]:
#possibly superfluous helper function
def df_oneliner(df):
    return f"The dataset contains {df.shape[0]} records."

In [None]:
#check that country and country codes are matchiung correctly
def nl_country_code_checker(database_session, node_type):
    #query building blocks
    query_start = "MATCH (n:" + node_type + ") WHERE "
    query_end = " RETURN COUNT(n)"
    query_country_codes = "n.country_codes CONTAINS 'NLD'"
    query_countries = "n.countries CONTAINS 'Netherlands'"
    #test queries
    query = query_start + query_country_codes + query_end
    result = database_session.run(query)
    code_count = result.value()[0]    
    query = query_start + query_countries + query_end
    result = database_session.run(query)
    country_count = result.value()[0]
    query = query_start + "(" + query_country_codes + " AND " + query_countries + ")" + query_end
    result = database_session.run(query)
    code_country_count = result.value()[0]
    #check results
    if (code_count==country_count and code_count==code_country_count):
        print(f"{code_count} {node_type} entries found for NL with country name and code are applied consistently.")
        return True
    else:
        string_temp = f"Country name ({country_count}) and codes ({code_count}) not applied consistently"
        string_temp = string_temp + f" for NL {node_type} entries, with {code_country_count} entries having both."
        print(string_temp)
        return False

## Database Summary Statistics

Before looking at the Dutch specific contents, it can be good to take an overall look at the database contents.

In [None]:
#get node types
query = "MATCH (n) WITH labels(n) as labels RETURN DISTINCT labels"
query_response = db_session.run(query)
node_types = pd.DataFrame([dict(record.data()) for record in query_response])
node_types

In [None]:
#get node frequency by type
query = "MATCH (n) RETURN COUNT(n), labels(n)"
query_response = db_session.run(query)
node_type_frequency = pd.DataFrame([dict(record.data()) for record in query_response])
node_type_frequency

In [None]:
#get all the property keys
query = "CALL db.propertyKeys()"
query_response = db_session.run(query)
node_source = pd.DataFrame([dict(record.data()) for record in query_response])
node_source.sort_values(by='propertyKey')

In [None]:
#get frequency by source (i.e. which leak the nodes are from)
query = "MATCH (n) RETURN COUNT(n), LEFT(n.sourceID,15)"
query_response = db_session.run(query)
node_source = pd.DataFrame([dict(record.data()) for record in query_response])
node_source

## DB Cleaning

From the above summary statistics, it is clear that there are some issues with the database. In this section, we investigate and try to correct for those so that subsequent queries do not become further complicated by them. Note that the corrections made will change the actual database, so that the statistics found below will be changed the second time they are run.

### Investigate Issues

Here we look into the various 'country' property keys and how they are used.

In [None]:
#investigate country vs countries
query = "MATCH (n) WHERE EXISTS(n.country) RETURN COUNT(n)"
query_response = db_session.run(query)
country_count = query_response.data(0)[0]['COUNT(n)']
string_temp = f"There are {country_count} nodes with the country property specified."
print(string_temp)
query = "MATCH (n) WHERE EXISTS(n.countries) RETURN COUNT(n)"
query_response = db_session.run(query)
countries_count = query_response.data(0)[0]['COUNT(n)']
string_temp = f"There are {countries_count} nodes with the countries property specified."
print(string_temp)
query = "MATCH (n) WHERE (EXISTS(n.country) AND EXISTS(n.countries)) RETURN COUNT(n) LIMIT 5"
query_response = db_session.run(query)
country_countries_count = query_response.data(0)[0]['COUNT(n)']
string_temp = f"There are {country_countries_count} nodes with both the country and countries properties specified."
print(string_temp)

In [None]:
#investigate country_code vs country_codes
query = "MATCH (n) WHERE EXISTS(n.country_code) RETURN COUNT(n)"
query_response = db_session.run(query)
country_code_count = query_response.data(0)[0]['COUNT(n)']
string_temp = f"There are {country_code_count} nodes with the country_code property specified."
print(string_temp)
query = "MATCH (n) WHERE EXISTS(n.country_codes) RETURN COUNT(n)"
query_response = db_session.run(query)
country_codes_count = query_response.data(0)[0]['COUNT(n)']
string_temp = f"There are {country_codes_count} nodes with the country_codes property specified."
print(string_temp)
query = "MATCH (n) WHERE (EXISTS(n.country_code) AND EXISTS(n.country_codes)) RETURN COUNT(n) LIMIT 5"
query_response = db_session.run(query)
country_code_codes_count = query_response.data(0)[0]['COUNT(n)']
string_temp = f"There are {country_code_codes_count} nodes with both the country_code and country_codes properties specified."
print(string_temp)

In [None]:
#double check that the singular naming is consistent with country and code
query = "MATCH (n) WHERE (EXISTS(n.country) AND EXISTS(n.country_code)) RETURN COUNT(n)"
query_response = db_session.run(query)
singular_count = pd.DataFrame([dict(record.data()) for record in query_response])
singular_count

In [None]:
#check source of the singular country property key
query = "MATCH (n) WHERE (EXISTS(n.country) AND EXISTS(n.country_code)) RETURN COUNT(n), LEFT(n.sourceID,15)"
query_response = db_session.run(query)
singular = pd.DataFrame([dict(record.data()) for record in query_response])
singular

In [None]:
query = "MATCH (n) WHERE EXISTS(n.countries) RETURN COUNT(n), LEFT(n.sourceID,15)"
query_response = db_session.run(query)
plural = pd.DataFrame([dict(record.data()) for record in query_response])
plural

In [None]:
#coutnries typo
query = "MATCH (n) WHERE EXISTS(n.coutnries) RETURN COUNT(n)"
query_response = db_session.run(query)
coutnries = pd.DataFrame([dict(record.data()) for record in query_response])
coutnries

### Fix Issues

Regarding the findings mentioned here, not that they are only correct with respect to a virgin copy of the database.
After these fixes are run, these will be changed of course.
However, all the _fixes_ should be addititive (adding new properties) rather than deleting anything.
In this way, genuinely breaking changes are hopefully avoided.

Despite finding a property key with a typo in the name countries, we don't find any nodes using it.
Cypher apparently lacks good support for removing unused property keys from the database.
This suggests it is just in there because this typo was once made, hence why I can't find it in use in the current database.
Futhermore, it also suggests it is very much not worth the effort for us to explicitly remove it.

For the singular vs plural issue around countries and country codes, we see that they are used disjointly.
It seems that in the paradise papers, they used the singular naming for cases where only one country is involved.
There is some naive, but flawed logic to that.
To simplify life, we will map all to the plural naming convention.
Note that after doing this the comparison with David's data search will need to be re-checked.

The sourceID column breaks down the source (by leak) into a level of granularity which I find unhelpful.
In this section, we will also make a new column called _leak_ that specifics this more simply.

In [None]:
#Neo4j doesn't seem to do batching to avoid crashes, so we do it manually
batch_size = 100000  #on my machine, this size gives execution per batch in a couple of seconds rather than crashing

In [None]:
#create leak column (like sourceID but simpler)
batch_number = 0
while True:  #Python doesn't have do while explicitly, but you can do it this way 
    query = "MATCH (n) WHERE (EXISTS(n.sourceID) AND NOT EXISTS(n.leak)) WITH n LIMIT " + str(batch_size)
    query = query + " SET n.leak = LEFT(n.sourceID,15) RETURN COUNT(n), n.leak"
    query_response = db_session.run(query)
    leak_source = pd.DataFrame([dict(record.data()) for record in query_response])
    batch_number = batch_number+1
    print(f"Batch {batch_number} of size {batch_size} completed.")
    if(leak_source.empty):
        break
query = "MATCH (n) RETURN COUNT(n), n.leak"
query_response = db_session.run(query)
leak_source = pd.DataFrame([dict(record.data()) for record in query_response])
leak_source

In [None]:
#merge singular country and country_code columns to plural
batch_number = 0
while True:  #Python doesn't have do while explicitly, but you can do it this way 
    query = "MATCH (n) WHERE (EXISTS(n.country) AND NOT EXISTS(n.countries)) WITH n LIMIT " + str(batch_size)
    query = query + " SET n.countries = n.country, n.country_codes = n.country_code RETURN COUNT(n)"
    query_response = db_session.run(query)
    country_plural = pd.DataFrame([dict(record.data()) for record in query_response])
    batch_number = batch_number+1
    print(f"Batch {batch_number} of size {batch_size} completed.")
    if country_plural.loc[0,'COUNT(n)']==0:
        break      
query = "MATCH (n) WHERE (EXISTS(n.country) AND NOT EXISTS(n.countries)) RETURN COUNT(n)"
query_response = db_session.run(query)
country_singular = pd.DataFrame([dict(record.data()) for record in query_response])
country_singular

## Find Dutch Addresses

In the dataset there are various country identifiers.
For entities, there are at least four relevant fields: 'country_codes', 'countries', 'juridisdiction_description' and 'address'. 
For officers, only the two country fields appear to be present. We start by checking whether the fields for country codes and countries are consistent with each other.

In [None]:
#quick checks
check_one = nl_country_code_checker(db_session,'Entity')
check_two = nl_country_code_checker(db_session,'Officer')
if(check_one and check_two):
    check=True
else:
    check=False
check

Note that the above quick checks do not capture situations where neither the country name or country code are provided as expected (e.g. the code _NL_ with country name _Nederland_ would be ignored and excluded). However, it does still tell us something about whether we can trust the codes and country naming. If the code returns more results and the name the same number as the results query, we do get the option to query more broadly (on country code) or more narrowly (on country name).

In [None]:
#look at officers 'mismatch'
query = "MATCH (n:Officer) WHERE (n.country_codes CONTAINS 'NLD' AND NOT n.countries CONTAINS 'Netherlands') RETURN n"
query_response = db_session.run(query)
officers_mismatch_nl = pd.DataFrame([dict(record.data()['n']) for record in query_response])
df_oneliner(officers_mismatch_nl)

In [None]:
officers_mismatch_nl

In [None]:
#look at entities
query = "MATCH (n:Entity) WHERE n.country_codes CONTAINS 'NLD' RETURN n"
query_response = db_session.run(query)
entities_nl = pd.DataFrame([dict(record.data()['n']) for record in query_response])
df_oneliner(entities_nl)

In [None]:
entities_nl

In [None]:
#look at officers
query = "MATCH (n:Officer) WHERE n.country_codes CONTAINS 'NLD' RETURN n"
query_response = db_session.run(query)
officers_nl = pd.DataFrame([dict(record.data()['n']) for record in query_response])
df_oneliner(officers_nl)

In [None]:
officers_nl

## Load Results David

David separately extracted data on entities and officers based in the Netherlands.
These are saved as csv files, but we also now have the explicit query strings.

In [None]:
#file names
entities_file_david = "entities_nl_address.csv"
officers_file_david = "officers_nl_address.csv"
#queries
query_entity_david = "MATCH (a:Address {countries: 'Netherlands'})-[rel:registered_address]-(e:Entity) RETURN e.name, a.address"
query_officer_david = "MATCH (a:Address {countries: 'Netherlands'})-[rel:registered_address]-(o:Officer) RETURN o.name, a.address"

In [None]:
#rerun queries
query_response = db_session.run(query_entity_david)
entities_david_reload = pd.DataFrame([dict(record.data()) for record in query_response])
print(df_oneliner(entities_david_reload))
query_response = db_session.run(query_officer_david)
officers_david_reload = pd.DataFrame([dict(record.data()) for record in query_response])
print(df_oneliner(officers_david_reload))

In [None]:
#saved data
entities_david_saved = pd.read_csv(os.path.join(data_david,entities_file_david))
print(df_oneliner(entities_david_saved))
officers_david_saved = pd.read_csv(os.path.join(data_david,officers_file_david))
print(df_oneliner(officers_david_saved))

In [None]:
#TODO: compare data with David's explicitly
#TODO: adjust David's queries to work with country code
#TODO: make composite search that finds all with Dutch connection
#TODO: make 2nd generation (or further) matching based on full Dutch datasets
#TODO: have summary statistics to compare prevalence of NL in the datasets (ideally by dataset)
#TODO: make a setting to open read only by default so that people will not update the DB unless they explicitly want to
#TODO: add batch size as general setting to the settings file