# Business Network Study

In this notebook, I look for ways to simplify the current cleaned network with multiple edge types into a network of a single canonical edge-type, or perhaps a network with only a few edge types.

In [4]:
#imports
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

#constants
%matplotlib inline
sns.set_style("dark")
sigLev = 3
figWidth = figHeight = 8

Let us load in ``../data/processed/cleanedNetwork.pkl`` for creating the does business with network. This requires careful reduction and simplification in node types and edge types.

In [5]:
complicatedNet = nx.read_gpickle("../data/processed/cleanedNetwork.pkl")

In [6]:
#get some quick metrics on this network
numNodes = len(complicatedNet.nodes())
numEdges = len(complicatedNet.edges())

We see that there are {{numNodes}} nodes and {{numEdges}} edges in this network. Note that this network is currently directed, and thus represent a set of directed relationships. This network also contains 4 different types of agents, as outlined in the [ICIJ terms and definitions section for the Offshore Leaks Database](https://offshoreleaks.icij.org/pages/about#terms_definition). Given the many different types of agents and relationships presented in this network, we need to spend some time finding how we can simplify this network.

The key components we are interested in for our analysis is:

* Agents that represent meaningful stakeholders. This means that we should really be considering those people who are financially obligated in the network, and remove agents (such as addresses) that don't represent meaningful financial actors.

* We should remove relationships that do not involve fiscal or business-wise obligations. Given that we are interested in the social capital created by the use of tax havens in the Global 1%, it is essential that we capture relationships that create the social capital involved in dealings with tax havens. This means relationships that do not represent fiscal or business-wise obligations will not have relevance when defining social capital within this network.

* We should look to simplify most of these relationships into simple categories. In order to allow for strong interpretability of the metrics on our analysis, having one or two simple relationships can help us with achieving this goal.

Let us begin our reductions and simplifications on this network.

## Node Reduction

As discussed in our [Initial Analysis Notebook](initialAnalysis.ipynb), close to 15% of the nodes in this network are addresses (see Initial Analysis Notebook, Figure 1). I would argue that these nodes do not represent financial actors and are more focused on locational paper trails. That being said, before we remove them, Let us see the kind of relationships associated with edges coming from address nodes and edges going to address nodes.

In [7]:
#get edges from addresses
fromAddressEdgeList = []
toAddressEdgeList = []
for edge in complicatedNet.edges(data = True):
    #get from and to nodes
    fromNode = edge[0]
    toNode = edge[1]
    #then get information on those nodes
    fromNodeType = complicatedNet.node[fromNode]["entType"]
    toNodeType = complicatedNet.node[toNode]["entType"]
    #check type
    if (fromNodeType == "Addresses"):
        fromAddressEdgeList.append(edge)
    if (toNodeType == "Addresses"):
        toAddressEdgeList.append(edge)

In [10]:
#function to get edge category frames
#taken from initialAnalysis.ipynb
def getEdgeInfo(edgeVec,keyName):
    #helper for returning a list of edge information over the whole list of
    #edges
    edgeInfoDict = {"edgeID":[],keyName:[]}
    for edgeTup in edgeVec:
        #0th entry is ID
        edgeInfoDict["edgeID"].append(edgeTup[0])
        #then get key info
        givenEdgeDict = edgeTup[2]
        edgeInfoDict[keyName].append(givenEdgeDict[keyName])
    return edgeInfoDict
#then get our edge type frames
fromAddressEdgeTypeFrame = pd.DataFrame(getEdgeInfo(fromAddressEdgeList,
                                                    "reltype"))
toAddressEdgeTypeFrame = pd.DataFrame(getEdgeInfo(toAddressEdgeList,"reltype"))

          edgeID             reltype
0       12159126  registered address
1       12217560  registered address
2       12220713  registered address
3       12133427  registered address
4       12220732  registered address
5       12191415  registered address
6       12131192  registered address
7       10044619  registered address
8       12153542  registered address
9       12158567  registered address
10      12159130  registered address
11      12203929  registered address
12      12133992  registered address
13      12153545  registered address
14      12159132  registered address
15      12153546  registered address
16      12133994  registered address
17      12133987  registered address
18      12153540  registered address
19      12133995  registered address
20      12153548  registered address
21      12133996  registered address
22      12153549  registered address
23      12159136  registered address
24      12133997  registered address
25      12164771  registered address
2