# DATA620 Final Project A
## Network Analysis

# Project Description
***

We would like to find businesses (or ideally people) who are influential in drawing foreign and domestic investment in Myanmar.  Through basic network analysis techniques, we hope to find a clique of people or companies with common investment sources or common activities.

# Data Description
***

This data was scraped and realeased anonymously from official government sources in two leaks called Myanmar Financials and Myanmar Investments.  The former is incorporation documents for ~125k companies, and the latter is information from investment proposals for about 10k companies.

# Known Challenges
***

1. About 1/4 of the companies in Myanmar Financials paid somebody to approve their incorporation documents without addresses or names... We can do nothing about this.

2. People in Myanmar are named for astrological information pertaining to their birth.  There is no family name, and many people have the same names.  We also cannot do anything about this.

3. A small number of names are given in Burmese script, which is included in UTF-8, but is unreadable to our team.  We can leave the script as-is, and remove punctuation.

# Method
***

1. Import and clean the data
2. Create an edge list:
> |Company Name |People
> --- | --- 
> |companyNameInMyanmar |officers, landOwner, nameOfInvestor
3. Project bipartite graph, view statistics
4. Trim edges, weighted by number of names per company
5. Visualize

In [1]:
import os
from pathlib import *
import json
import pandas as pd
import networkx as nx
import matplotlib as plt
from bs4 import BeautifulSoup
import re
import string

In [2]:
# Company incorporation documents
com_dir = Path('/home/s/fpa/data/company_info')

# Investment proposals and information about real projects
inv_dir = Path('/home/s/fpa/data/investment_info')

# Helper Functions
***
#### pathToList()
- Takes a Path object for a directory full of JSON files
- Returns a list containing a dict for each file read in

#### companyInfo()
- Takes a list of dicts
- Extracts copmany name and names of officers
- Returns a DataFrame with this information

#### investmentInfo()
- Takes a list of lists of dicts
- Converts dicts to DataFrames
- Returns list of DataFrames

#### cleanDF()
- Takes a DataFrame
- Converts all letters to lowercase
- Substitutes ltd with limited
- Removes all punctuation

#### defineEdges()
- Takes company information and a list of other DataFrames
- Creates dictionary with key company name and value list of people
- Adds investors to edge list
- Returns edge list dictionary

In [3]:
def pathToList(path_obj):
    file_list = []
    for file_path in path_obj.iterdir():
        data = json.loads(file_path.read_bytes())
        file_list.append(data)
    return(file_list)

def companyInfo(com_list):
    info_list = []
    # Iterate over all companies in the list
    for i in range(len(com_list)):
        info = {}
        # Extract Company Name to "companyNameInMyanmar"
        info['companyNameInMyanmar'] = com_list[i]['Corp']['CompanyName']
        # Extract Officer names
        for b in range(len(com_list[i]['Officers'])):
            info['officer' + str(b)] = com_list[i]['Officers'][b]['FullNameNormalized']
        # Convert to DataFrame
        info_list.append(info)
    df = pd.DataFrame.from_dict(info_list, orient = 'columns')
    return(df)

def investmentInfo(inv_list):
    investments = []
    # Extract data from investment documents
    for i in range(len(inv_list)):
        d = pd.DataFrame.from_dict(
            [x for x in inv_list[i]['data']], orient = 'columns')
        investments.append(d)
    return(investments)

def cleanDF(df):
    # Convert to lowercase
    df = df.applymap(str)
    df = df.applymap(lambda s:s.lower())
    # ltd -> limited
    df = df.applymap(lambda s:s.replace('ltd', 'limited'))
    # Remove punctuation
    df = df.applymap(lambda s:s.translate(str.maketrans('', '', string.punctuation)))         
    # Strip trailing spaces
    df = df.applymap(lambda s:s.strip)
    return(df)

def defineEdges(companies, investments):
    edges = {}
    # add all companies to edge list
    for i in range(len(companies)):
        co = companies.iloc[i,0]
        edges[co] = []

        # add officers as connections
        for j in range(1,68):
            officer = companies.iloc[i,j]
            if officer != 'nan':
                co.append(officer)
            else:
                break
                
    for doc in investments:
        doc = cleanDF(doc)
        for i in range(len(doc)):
            co = doc.iloc[i]['companyNameInMyanmar']
            inv = doc.iloc[i]['nameOfInvestor']

            # add remaining companies to edge list
            if co not in edges.keys():
                edges[co] = []

            # add investors as connections
            if inv not in edges[co]:
                edges[co].append(str(inv).replace(' ', ''))
    return(edges)

#com_with_info = companies.dropna(subset = ['address0', 'officer0name'], how = 'all')

# Preprocessing
***

In [4]:
companies = cleanDF(companyInfo(pathToList(com_dir)))
investments = investmentInfo(pathToList(inv_dir))

In [5]:
defineEdges(companies, investments)

AttributeError: 'builtin_function_or_method' object has no attribute 'append'

# Network Analysis
***

# Conclusions
***