## Network Analysis

Video: https://ufile.io/nl4xsrgc

# Project Description
***

We would like to find businesses who are influential in drawing foreign and domestic investment in Myanmar.  Through basic network analysis techniques, we hope to find a clique of companies with common investment sources or common activities by referencing officers who are registered to more than one company.

# Data Description
***

This data was scraped and realeased anonymously from official government sources in two leaks called Myanmar Financials and Myanmar Investments.  The former is incorporation documents for ~125k companies, and the latter is information from investment proposals for about 10k companies.

# Known Challenges
***

1. About 1/4 of the companies in Myanmar Financials paid somebody to approve their incorporation documents without addresses or names... We can remove companies with no listed officers.

2. People in Myanmar are named for astrological information pertaining to their birth.  There is no family name, and many people have the same names.  with the possibility of 1 of 2 or no titles, and a conservative estimate of 40 possible name words, we can expect around 200,000 different name combinations.  Some will be more common than others, but this is not a huge problem for the scale and shape of our data.

3. A small number of names are given in Burmese script, which is included in UTF-8, but is unreadable to our team.  We can leave the script as-is, and remove punctuation.

# Method
***

1. Import and clean the data
2. Create an edge list:
> |Company Name |People
> --- | --- 
> |companyNameInMyanmar |officers, landOwner, nameOfInvestor
3. Project bipartite graph, view statistics
4. Trim edges, weighted by number of names per company
5. Visualize

In [1]:
import os
import scipy
from pathlib import *
import json
import pandas as pd
import networkx as nx
from networkx.algorithms import bipartite
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import re
import string
from collections import defaultdict
from itertools import combinations
from pyvis.network import Network

# Company incorporation documents
com_dir = Path('/home/s/fpa/data/company_info')
# Investment proposals and information about real projects
inv_dir = Path('/home/s/fpa/data/investment_info')

# Helper Functions
***
#### pathToList()
- Takes a Path object for a directory full of JSON files
- Returns a list containing a dict for each file read in

#### companyInfo()
- Takes a list of dicts
- Extracts copmany name and names of officers
- Returns a DataFrame with this information

#### investmentInfo()
- Takes a list of lists of dicts
- Converts dicts to DataFrames
- Returns list of DataFrames

#### cleanDF()
- Takes a DataFrame
- Converts all letters to lowercase
- Substitutes ltd with limited
- Removes all punctuation

#### defineEdges()
- Takes company information and a list of other DataFrames
- Creates dictionary with key company name and value list of people
- Adds investors to edge list
- Returns edge list dictionary

#### expandEdges()
- Takes a dictionary of company nodes with lists of connected people
- Returns a list of edges with only two nodes on each line

#### trimEdges()
- Takes a networkX graph object and an integer
- Trims away the connections weighted less than the integer
- Returns a new networkX graph object

#### islandMethod()
- Takes a networkX graph object and an integer
- Iterates trimEdges() on the graph for predefined number of trials
- Returns a list of graphs with some metadata

In [2]:
def pathToList(path_obj):
    file_list = []
    for file_path in path_obj.iterdir():
        data = json.loads(file_path.read_bytes())
        file_list.append(data)
    return(file_list)

def companyInfo(com_list):
    info_list = []
    # Iterate over all companies in the list
    for i in range(len(com_list)):
        info = {}
        # Extract Company Name to "companyNameInMyanmar"
        info['companyNameInMyanmar'] = com_list[i]['Corp']['CompanyName']
        # Extract Officer names
        for b in range(len(com_list[i]['Officers'])):
            info['officer' + str(b)] = com_list[i]['Officers'][b]['FullNameNormalized']
        # Convert to DataFrame
        info_list.append(info)
    df = pd.DataFrame.from_dict(info_list, orient = 'columns')
    return(df)

def investmentInfo(inv_list):
    investments = []
    # Extract data from investment documents
    for i in range(len(inv_list)):
        d = pd.DataFrame.from_dict(
            [x for x in inv_list[i]['data']], orient = 'columns')
        investments.append(d)
    return(investments)

def cleanDF(df):
    # Convert to lowercase
    df = df.applymap(str)
    df = df.applymap(lambda s:s.lower())
    # Remove some useless words
    useless = set(('ltd', 'limited', 'company',))
    for word in useless:
        df = df.applymap(lambda s:s.replace(word, ''))
    # Remove punctuation
    df = df.applymap(lambda s:s.translate(str.maketrans('', '', string.punctuation)))
    # Strip leading and trailing whitespace
    df = df.applymap(lambda s:s.strip())
    return(df)

def defineEdges(companies, investments):
    edges = {}
    # add all companies to edge list
    for i in range(len(companies)):
        co = companies.iloc[i,0]
        edges[co] = []

        # add officers as connections
        for j in range(1,68):
            officer = companies.iloc[i,j]
            if officer != 'nan':
                edges[co].append(officer)
            else:
                break
    for doc in investments:
        doc = cleanDF(doc)
        for i in range(len(doc)):
            co = doc.iloc[i]['companyNameInMyanmar']
            inv = doc.iloc[i]['nameOfInvestor']
            # add remaining companies to edge list
            if co not in edges.keys():
                edges[co] = []
            # add investors as connections
            if inv not in edges[co]:
                edges[co].append(str(inv).replace(' ', ''))
    return(edges)

def expandEdges(edges):
    expanded_edges = []
    for company in edges:
        for person in company:
            edge = (company, person)
            expanded_edges.append(edge)
    return(expanded_edges)

def trimEdges(G, weight = 1):
    G2 = nx.Graph()
    edata = nx.to_pandas_edgelist(G)
    f = edata['source']
    to = edata['target']
    
    for f, to, edata in G.edges(data = True):
        if edata['weight'] > weight:
            G2.add_edge(f, to)
    return G2

def islandMethod(g, iterations = 17):
    weights = [edata['weight'] for f, to, edata in g.edges(data = True)]
    mn = 1
    mx = iterations
    step = 1
    return([[threshold, trimEdges(g, threshold)]
           for threshold in range(mn, mx, step)])

# Preprocessing
***

Below, we load the data, clean it up, and generate an edge list to be used in our network graph.

In [3]:
companies = cleanDF(companyInfo(pathToList(com_dir)))
# Drop rows with no names
companies = companies[companies.officer0 != 'nan']

investments = investmentInfo(pathToList(inv_dir))
edges = defineEdges(companies, investments)
expanded_edges = expandEdges(edges)

reversed_edges = defaultdict(list)
for key, value in edges.items():
    for item in value:
        reversed_edges[item].append(key)

# Network Analysis
***

We create a graph and project a new bipartite graph connecting companies to companies through the people nodes.  Repeated connections are given as greater weights.  We can then trim the edges of the graph progressively, showing which companies have the strongest connections to the other members of the network.

In [4]:
G = nx.Graph()

company_nodes = edges.keys()
people_nodes = reversed_edges.keys()

G.add_nodes_from(company_nodes, bipartite = 'Companies')
G.add_nodes_from(people_nodes, bipartite = 'People')
G.add_edges_from(expanded_edges)

P = bipartite.weighted_projected_graph(G, people_nodes)

print('Total Nodes:  ', G.number_of_nodes(),
      '\n',
      'Total Edges:  ', G.number_of_edges(),
      '\n',
      'Bipartite Edges:  ', P.number_of_edges())

Total Nodes:   217824 
 Total Edges:   1029353 
 Bipartite Edges:   4000128


In [5]:
islands_P = islandMethod(P)

for i in islands_P:
    print('Threshold: ' + str(i[0]))
    print('Number of Nodes: ' + str(len(i[1])))
    print('Length of Connected Components')
    display(pd.DataFrame(pd.Series([len(c) for c in sorted(nx.connected_components(i[1]), key=len)]).value_counts().sort_index()).T)
    print()

Threshold: 1
Number of Nodes: 92878
Length of Connected Components


Unnamed: 0,96,92782
0,1,1



Threshold: 2
Number of Nodes: 91320
Length of Connected Components


Unnamed: 0,90,91230
0,1,1



Threshold: 3
Number of Nodes: 87349
Length of Connected Components


Unnamed: 0,86,87263
0,1,1



Threshold: 4
Number of Nodes: 76616
Length of Connected Components


Unnamed: 0,84,76532
0,1,1



Threshold: 5
Number of Nodes: 55980
Length of Connected Components


Unnamed: 0,82,55898
0,1,1



Threshold: 6
Number of Nodes: 29292
Length of Connected Components


Unnamed: 0,82,29210
0,1,1



Threshold: 7
Number of Nodes: 8888
Length of Connected Components


Unnamed: 0,79,8809
0,1,1



Threshold: 8
Number of Nodes: 1325
Length of Connected Components


Unnamed: 0,79,361,885
0,1,1,1



Threshold: 9
Number of Nodes: 439
Length of Connected Components


Unnamed: 0,78,361
0,1,1



Threshold: 10
Number of Nodes: 438
Length of Connected Components


Unnamed: 0,77,361
0,1,1



Threshold: 11
Number of Nodes: 437
Length of Connected Components


Unnamed: 0,77,360
0,1,1



Threshold: 12
Number of Nodes: 427
Length of Connected Components


Unnamed: 0,76,351
0,1,1



Threshold: 13
Number of Nodes: 417
Length of Connected Components


Unnamed: 0,76,341
0,1,1



Threshold: 14
Number of Nodes: 381
Length of Connected Components


Unnamed: 0,2,76,303
0,1,1,1



Threshold: 15
Number of Nodes: 324
Length of Connected Components


Unnamed: 0,2,76,246
0,1,1,1



Threshold: 16
Number of Nodes: 251
Length of Connected Components


Unnamed: 0,76,175
0,1,1





Before visualizing the graph, we can look at the summary statistics to get an idea of how these companies rank.

In [6]:
# Choose an island and remove garbage names
island = islands_P[15][1]
single_char_names = list(island.nodes())[:79]
island.remove_nodes_from(single_char_names)

# Degree, Eigenvector Centrality and Betweenness are weighted by the total
# number of shared contacts.  DEGREE_CENTRALITY and CLOSENESS are not weighted
summaries = pd.DataFrame(dict(
    degree = dict(island.degree(weight = 'weight')),
    degree_centrality = nx.degree_centrality(island),
    eigenvector = nx.eigenvector_centrality_numpy(island, weight = 'weight'),
    closeness = nx.closeness_centrality(island),
    betweenness = nx.betweenness_centrality(island, weight = 'weight')))

summaries = summaries.sort_values(by = 'degree', ascending = False)
summaries[:10]

Unnamed: 0,degree,degree_centrality,eigenvector,closeness,betweenness
မြန်မာမင်းဒွန်းဂါးမင့်မင်နျူဖက်ချားရင်းကုမ္ပဏီလီမိတက်,122,0.71345,0.43808,0.745434,0.383041
ထရိုင်းဒင့်မြန်မာဂျင်နရယ်ထရိတ်ဒင်းကုမ္ပဏီလီမိတက်,117,0.684211,0.43496,0.728178,0.310449
အရှေ့တိုင်းအထည်ချုပ်ကုမ္ပဏီလီမိတက်,102,0.596491,0.382035,0.680894,0.28989
စူပါဂါးမင့်န်ကုမ္ပဏီလီမိတက်,10,0.05848,0.056768,0.433296,0.001183
အုန်းမိသားစုကုမ္ပဏီလီမိတက်,9,0.052632,0.021807,0.329741,0.033758
ခရစ်စတယ်လ်ဂါးမင့်အင်န်နစ်တင်းအင်ဒက်စ်တရီကုမ္ပဏီလီမိတက် krystal garment knitting industry co,5,0.02924,0.074167,0.472332,0.011386
ရတနာနစ်တ်အင်န်ဂါးမန့်တ်မန်နူဖက်ချားရင်းကုမ္ပဏီလီမိတက် yandana knit and garment manufacturing,5,0.02924,0.074167,0.472332,0.011386
ငွေပင်လယ်လှိုင်းတွန့်စက္ကူဘူးထုတ်လုပ်ရေးကုမ္ပဏီလီမိတက်silver sea paper carton box production co,4,0.023392,0.073697,0.466726,0.00031
ထွန်းပဒေသာအင်ဒတ်စတြီးစ်အင်န် ထရိတ်ဒင်းကုမ္ပဏီလီမိတက်,4,0.023392,0.071733,0.476626,0.007964
မြန်မာမင်းဒွန်းဂါးမင့်မင်နျူဖက်ချားရင်း ကုမ္ပဏီလီမိတက်,4,0.023392,0.073697,0.466726,0.00031


In [8]:
net = Network(notebook = True)
net.from_nx(island)
net.width = '2000px'
net.height = '1000px'
net.repulsion(node_distance=200, central_gravity=0.01,
              spring_length=300, spring_strength=0.01,
              damping=0.09)
net.save_graph("Highly-Connected_Companies.html")

# Conclusions
---

We have managed to clean up an extremely large amount of very messy data from 7 separate sources.  We have created a graph that connects a hundred thousand companies by their registered officers, investors, and landowners.  In the end, we are able to find, through basic network analysis methods, the most connected companies, who are very likely drawing the greatest amount of foreign investment to Myanmar.

We can draw no firm conclusions about how connected these companies are to the junta.  However, this is a fantastic place to begin further research.