# Dataset Documentation


Types of connections:
* Shared entity ownership/directorship - Officers connected through common companies or entities
* Intermediary relationships - Officers using the same service providers/law firms
* Address connections - Officers sharing registered addresses
* Temporal connections - Officers establishing entities in similar timeframes
* Jurisdictional connections - Officers operating in the same offshore jurisdictions

Find out how we parameterize this network - what measures are important. It's likely that we're interested in the intersection of intermediaries and countries. E.g. conditinoing on some given country, what's the network density of the bipartite graph (projected down) of different countries?
E.g. we have the following simpliefied model of types:
* People; Intermediaries; Country.

We condition on Country, to get a bipartite graph that's consisting of People and intermediaries. Project taht down so people are considered connected if they use the same intermediary

Comparing all of them to what we'd expect ifor a given Erdos-Renyi model (where given n nodes, we model each edge as Bernouilli(p) of whether it exists. Notice, that we have edges proportional to the number of nodes, hence a constant density. We should probably control for osomething like size of countries - e.g. this woud be if supply-side models arae correct , and they optimise on th enumber of lients, then we'd have a constant number of clients)

Bender and Canfield, 1978: Configuraiton models
Here we address problem of degree distribution by configuring it for a give ndegree distributino taken exogenously. Centrality and so on still work as a means of comparison in that case! Don't know about null statistics - we can likely generate some vairance for a given node, and use that paired with an assumption of normality/

# Prereqs

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.express as px
import wbdata
import plotly.express as px
import numpy as np
from datetime import datetime

In [3]:
path_to_dataset = Path("datasets/icij/FullDatasetLong")

officers_path = path_to_dataset / "nodes-officers.csv"
intermediaries_path = path_to_dataset / "nodes-intermediaries.csv"

df_officers = pd.read_csv(officers_path)
df_intermediaries = pd.read_csv(intermediaries_path)

# Exploratory Analysis

In [7]:
df_intermediaries.head()

Unnamed: 0,node_id,name,status,internal_id,address,countries,country_codes,sourceID,valid_until,note
0,11000001,"MICHAEL PAPAGEORGE, MR.",ACTIVE,10001,MICHAEL PAPAGEORGE; MR. 106 NICHOLSON STREET B...,South Africa,ZAF,Panama Papers,The Panama Papers data is current through 2015,
1,11000002,CORFIDUCIA ANSTALT,ACTIVE,10004,,Liechtenstein,LIE,Panama Papers,The Panama Papers data is current through 2015,
2,11000003,"DAVID, RONALD",SUSPENDED,10014,,Monaco,MCO,Panama Papers,The Panama Papers data is current through 2015,
3,11000004,"DE BOUTSELIS, JEAN-PIERRE",SUSPENDED,10015,,Belgium,BEL,Panama Papers,The Panama Papers data is current through 2015,
4,11000005,THE LEVANT LAWYERS (TLL),ACTIVE,10029,,Lebanon,LBN,Panama Papers,The Panama Papers data is current through 2015,


In [9]:
df_intermediaries.groupby('sourceID').size().sort_values(ascending=False)

sourceID
Panama Papers                                                14110
Offshore Leaks                                                9526
Pandora Papers - Alemán, Cordero, Galindo & Lee (Alcogal)     1023
Paradise Papers - Barbados corporate registry                  974
Bahamas Leaks                                                  541
Paradise Papers - Bahamas corporate registry                   239
Paradise Papers - Appleby                                      185
Paradise Papers - Nevis corporate registry                      96
Paradise Papers - Aruba corporate registry                      74
dtype: int64

In [10]:
df_intermediaries.groupby(['sourceID', 'country_codes']).size().sort_values(ascending=False)

sourceID        country_codes
Offshore Leaks  HKG              2684
Panama Papers   HKG              2202
                GBR              1367
Offshore Leaks  TWN              1274
                SGP              1243
                                 ... 
Panama Papers   OMN                 1
                GBR;ARE             1
                FRA;MCO             1
Offshore Leaks  XXX;IND             1
                AIA                 1
Length: 403, dtype: int64

In [6]:
df_officers.head()

Unnamed: 0,node_id,name,countries,country_codes,sourceID,valid_until,note
0,12000001,KIM SOO IN,South Korea,KOR,Panama Papers,The Panama Papers data is current through 2015,
1,12000002,Tian Yuan,China,CHN,Panama Papers,The Panama Papers data is current through 2015,
2,12000003,GREGORY JOHN SOLOMON,Australia,AUS,Panama Papers,The Panama Papers data is current through 2015,
3,12000004,MATSUDA MASUMI,Japan,JPN,Panama Papers,The Panama Papers data is current through 2015,
4,12000005,HO THUY NGA,Viet Nam,VNM,Panama Papers,The Panama Papers data is current through 2015,


In [4]:
df_officers['countries'].value_counts()

countries
Malta                                    45042
Not identified                           39450
China                                    38275
Hong Kong                                30226
United States                            26958
                                         ...  
Australia;Germany;Netherlands                1
Jersey;Portugal                              1
Germany;Singapore;United Kingdom             1
Australia;South Africa;United Kingdom        1
Turkmenistan;Russian Federation              1
Name: count, Length: 4090, dtype: int64

In [4]:
import plotly.express as px
import numpy as np

countries_split = df_officers['countries'].str.split(';').explode()
country_counts = countries_split.value_counts()

fig = px.choropleth(
    locations=country_counts.index,
    locationmode='country names',
    color=np.log10(country_counts.values),  # Apply log transformation
    color_continuous_scale='Viridis',
    title='Number of Officers by Country (Log Scale)',
    labels={'color': 'Log10(Number of Officers)'}
)

# Update the layout
fig.update_layout(
    title_x=0.5,
    geo=dict(showframe=False, showcoastlines=True, projection_type='equirectangular'),
    width=1000,
    height=600
)

fig.show()

In [5]:
country_counts.index.unique()

Index(['Malta', 'China', 'Not identified', 'Hong Kong', 'United States',
       'United Kingdom', 'Taiwan', 'British Virgin Islands', 'Italy',
       'Switzerland',
       ...
       'Channel Islands', 'Tuvalu', 'Korea, Republic of', 'Mayotte',
       'French Southern Territories', 'Greenland', 'Bhutan',
       'Sao Tome and Principe', 'Eswatini', ' Croatia'],
      dtype='object', name='countries', length=268)

In [6]:
import wbdata
import pandas as pd
import plotly.express as px
import numpy as np
from datetime import datetime

# Split and count countries as before
countries_split = df_officers['countries'].str.split(';').explode()
country_counts = countries_split.value_counts()

# Get population data
population_indicator = 'SP.POP.TOTL'  # Total population indicator
date = datetime(2015, 1, 1)  # Using 2015 as reference year

# Fetch population data
pop_data = wbdata.get_dataframe({population_indicator: 'Population'})
pop_data = pop_data.reset_index()

# Create a mapping dictionary for country names (you might need to adjust some names)
country_name_fixes = {
    'United States': 'United States of America',
    'UK': 'United Kingdom',
    # Add more mappings if needed
}

# Apply name fixes
countries_fixed = countries_split.map(lambda x: country_name_fixes.get(x, x))
country_counts_fixed = countries_fixed.value_counts()

# Merge population data with officer counts
merged_data = pd.DataFrame({
    'country': country_counts_fixed.index,
    'officers': country_counts_fixed.values
})
merged_data = merged_data.merge(pop_data, left_on='country', 
                              right_on='country', how='left')

# Calculate officers per million
merged_data['officers_per_million'] = (merged_data['officers'] / 
                                     merged_data['Population']) * 1_000_000


merged_data_filtered = merged_data[merged_data['officers_per_million'] < 300]

fig = px.choropleth(
    merged_data_filtered,
    locations='country',
    locationmode='country names',
    color=merged_data_filtered['officers_per_million'].fillna(0),
    # color=np.log10(merged_data['officers_per_million'].fillna(0) + 1),  # Add 1 to handle zeros
    color_continuous_scale='Viridis_r',
    title='nodes_officers per Million (WB, 2015)',
    labels={'color': 'nodes_officers per Million'}
)

fig.update_layout(
    title_x=0.5,
    geo=dict(showframe=False, showcoastlines=True, projection_type='equirectangular'),
    width=1000,
    height=600
)

fig.show()

print("\nTop 15 countries by officers per million inhabitants:")
print(merged_data_filtered.sort_values('officers_per_million', ascending=False)
      .head(15)[['country', 'officers_per_million', 'officers', 'Population']]
      .round(2))


Top 15 countries by officers per million inhabitants:
          country  officers_per_million  officers  Population
4948      Estonia                299.94       411   1370286.0
4968      Estonia                299.84       411   1370720.0
9238        Tonga                299.82        29     96725.0
4628       Latvia                298.88       572   1913822.0
3437       Norway                298.85      1258   4209488.0
5000      Estonia                298.48       411   1376955.0
3282  New Zealand                298.46      1348   4516500.0
9237        Tonga                298.34        29     97204.0
1807       Sweden                298.09      2921   9799186.0
4969      Estonia                297.97       411   1379350.0
3436       Norway                297.62      1258   4226901.0
1331     Malaysia                297.44      3756  12627862.0
6742         Fiji                297.30       128    430536.0
9236        Tonga                297.30        29     97544.0
3223      Leban

In [7]:
merged_data.shape

(11860, 5)

In [8]:
# Define Global South countries (this is a simplified list, you might want to adjust it)
global_south = [
    'Afghanistan', 'Algeria', 'Angola', 'Argentina', 'Bangladesh', 'Benin', 
    'Bolivia', 'Botswana', 'Brazil', 'Burkina Faso', 'Burundi', 'Cambodia', 
    'Cameroon', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 
    'Congo', 'Costa Rica', 'Cuba', 'DR Congo', 'Dominican Republic', 'Ecuador', 
    'Egypt', 'El Salvador', 'Ethiopia', 'Gabon', 'Ghana', 'Guatemala', 'Guinea', 
    'Haiti', 'Honduras', 'India', 'Indonesia', 'Iran', 'Iraq', 'Ivory Coast', 
    'Jamaica', 'Jordan', 'Kenya', 'Laos', 'Lebanon', 'Lesotho', 'Liberia', 
    'Libya', 'Madagascar', 'Malawi', 'Malaysia', 'Mali', 'Mauritania', 'Mexico', 
    'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Nicaragua', 'Niger', 
    'Nigeria', 'Pakistan', 'Papua New Guinea', 'Paraguay', 'Peru', 
    'Philippines', 'Rwanda', 'Saudi Arabia', 'Senegal', 'Sierra Leone', 'Somalia', 
    'South Africa', 'South Sudan', 'Sri Lanka', 'Sudan', 'Syria', 'Tanzania', 
    'Thailand', 'Togo', 'Tunisia', 'Uganda', 'Uruguay', 'Venezuela', 'Vietnam', 
    'Yemen', 'Zambia', 'Zimbabwe'
]

# Filter the merged data for Global South countries
global_south_data = merged_data[merged_data['country'].isin(global_south)]

# Create choropleth map for Global South
fig = px.choropleth(
    global_south_data,
    locations='country',
    locationmode='country names',
    color=np.log10(global_south_data['officers_per_million'].fillna(0) + 1),
    color_continuous_scale='Viridis',
    title='Log (base-10) nodes_officers per Million (WB tot. pop., 2015)',
    labels={'color': 'Log(10) nodes_officers per Million'}
)

# Update the layout to focus on Global South
fig.update_layout(
    title_x=0.5,
    geo=dict(
        showframe=False,
        showcoastlines=True,
        projection_type='equirectangular',
        # Adjust the map center and zoom to focus on Global South
        center=dict(lat=0, lon=20),
        projection_scale=1.5
    ),
    width=1000,
    height=600
)

fig.show()

# Print statistics for Global South countries
print("\nTop 15 Global South countries by officers per million inhabitants:")
print(global_south_data.sort_values('officers_per_million', ascending=False)
      .head(15)[['country', 'officers_per_million', 'officers', 'Population']]
      .round(2))

# Print some summary statistics
print("\nSummary statistics for Global South countries:")
print(f"Total number of countries with data: {len(global_south_data)}")
print(f"Total number of officers: {global_south_data['officers'].sum():,}")
print(f"Average officers per million: {global_south_data['officers_per_million'].mean():.2f}")
fig.write_image("images/global_south_officers_custom.jpg", width=1920, height=1080)


Top 15 Global South countries by officers per million inhabitants:
     country  officers_per_million  officers  Population
1222   Libya               2628.46      3924   1492890.0
1221   Libya               2522.34      3924   1555699.0
1220   Libya               2419.30      3924   1621960.0
1219   Libya               2321.27      3924   1690457.0
1218   Libya               2237.57      3924   1753691.0
1217   Libya               2175.15      3924   1804015.0
1216   Libya               2122.16      3924   1849063.0
1215   Libya               2067.22      3924   1898199.0
1214   Libya               2009.44      3924   1952781.0
1213   Libya               1948.08      3924   2014293.0
1212   Libya               1884.19      3924   2082589.0
1211   Libya               1822.12      3924   2153534.0
1210   Libya               1762.73      3924   2226091.0
1209   Libya               1695.62      3924   2314193.0
1208   Libya               1613.35      3924   2432211.0

Summary statistics 