# General Dutch Survey

This notebook searches for any Dutch references in the offshore leaks database.

## Preliminaries

Please note the general set-up requirements contained in the repo readme and the _showcase_ notebook.
Remember that the database needs to be running locally for this workbook to work.

In [None]:
#imports
import os  #to find the settings file(s)
import csv #to process the settings file(s)
import shutil #to copy the settings file (if needed)
from neo4j import GraphDatabase
import pandas as pd

In [None]:
#get settings
settings_dir = os.path.join("..","settings")
personal_settings = os.path.join(settings_dir,"personal_settings.csv")
if not "personal_settings.csv" in os.listdir(settings_dir):
    default_settings = os.path.join(settings_dir,"default_settings.csv")
    shutil.copy(default_settings, personal_settings)
    print("Created new personal settings file, this probably needs to be edited before proceeding.")
with open(personal_settings, mode = 'r') as file:
    user_settings = {}
    for line in csv.DictReader(file):
        user_settings[line['setting']] = line['value']
db_uri = "bolt://localhost:" + str(user_settings['port_number'])

In [None]:
#data path
data_root = os.path.join("..","data")
data_david = os.path.join(data_root,"extracts","david")

In [None]:
db_connection = GraphDatabase.driver(db_uri, auth=(user_settings['username'],user_settings['password']))

In [None]:
db_session = db_connection.session(database=user_settings['db_name'])

In [None]:
#possibly superfluous helper function
def df_oneliner(df):
    return f"The dataset contains {df.shape[0]} records."

In [None]:
#check that country and country codes are matchiung correctly
def nl_country_code_checker(database_session, node_type):
    #query building blocks
    query_start = "MATCH (n:" + node_type + ") WHERE "
    query_end = " RETURN COUNT(n)"
    query_country_codes = "n.country_codes CONTAINS 'NLD'"
    query_countries = "n.countries CONTAINS 'Netherlands'"
    #test queries
    query = query_start + query_country_codes + query_end
    result = database_session.run(query)
    code_count = result.value()[0]    
    query = query_start + query_countries + query_end
    result = database_session.run(query)
    country_count = result.value()[0]
    query = query_start + "(" + query_country_codes + " AND " + query_countries + ")" + query_end
    result = database_session.run(query)
    code_country_count = result.value()[0]
    #check results
    if (code_count==country_count and code_count==code_country_count):
        print(f"{code_count} {node_type} entries found for NL with country name and code are applied consistently.")
        return True
    else:
        string_temp = f"Country name ({country_count}) and codes ({code_count}) not applied consistently"
        string_temp = string_temp + f" for NL {node_type} entries, with {code_country_count} entries having both."
        print(string_temp)
        return False

## Find Dutch Addresses

In the dataset there are various country identifiers.
For entities, there are at least four relevant fields: _country_codes_, _countries_, _juridisdiction_description_ and _address_. 
For officers, only the two country fields appear to be present. We start by checking whether the fields for country codes and countries are consistent with each other.

In [None]:
#quick checks
check_one = nl_country_code_checker(db_session,'Entity')
check_two = nl_country_code_checker(db_session,'Officer')
if(check_one and check_two):
    check=True
else:
    check=False
check

Note that the above quick checks do not capture situations where neither the country name or country code are provided as expected (e.g. the code _NL_ with country name _Nederland_ would be ignored and excluded). However, it does still tell us something about whether we can trust the codes and country naming. If the code returns more results and the name the same number as the results query, we do get the option to query more broadly (on country code) or more narrowly (on country name).

In [None]:
#look at officers 'mismatch'
query = "match (n:Officer) where (n.country_codes CONTAINS 'NLD' AND NOT n.countries CONTAINS 'Netherlands') return n"
query_response = db_session.run(query)
officers_mismatch_nl = pd.DataFrame([dict(record.data()['n']) for record in query_response])
df_oneliner(officers_mismatch_nl)

In [None]:
officers_mismatch_nl

In [None]:
#look at entities
query = "MATCH (n:Entity) WHERE n.country_codes CONTAINS 'NLD' RETURN n"
query_response = db_session.run(query)
entities_nl = pd.DataFrame([dict(record.data()['n']) for record in query_response])
df_oneliner(entities_nl)

In [None]:
entities_nl

In [None]:
#look at officers
query = "MATCH (n:Officer) WHERE n.country_codes CONTAINS 'NLD' RETURN n"
query_response = db_session.run(query)
officers_nl = pd.DataFrame([dict(record.data()['n']) for record in query_response])
df_oneliner(officers_nl)

In [None]:
officers_nl

## Load Results David

David separately extracted data on entities and officers based in the Netherlands.

In [None]:
#file names
entities_file_david = "entities_nl_address.csv"
officers_file_david = "officers_nl_address.csv"

In [None]:
#entities david
entities_david = pd.read_csv(os.path.join(data_david,entities_file_david))
df_oneliner(entities_david)

In [None]:
#officers david
officers_david = pd.read_csv(os.path.join(data_david,officers_file_david))
df_oneliner(officers_david)

In [None]:
#TODO: compare data with David's explicitly
#TODO: make composite search that finds all with Dutch connection
#TODO: make 2nd generation (or further) matching based on full Dutch datasets
#TODO: have summary statistics to compare prevalence of NL in the datasets (ideally by dataset)