** DTU - 02805 Social graphs and interactions (2016) **

# Project Assignment B
## Company Network

A [video description](https://www.youtube.com/watch?v=HNrX5dCCrNw) gives a basic overview of the project. 


The main motivation was to get insights about the network of companies. Our dataset is extracted from articles about companies on [Wikipedia](https://www.wikipedia.org/). There is some coverage in part _1.2._ explaining how other data sources were incorporated to gather extra information that includes the type, industry, location and much more of a company.

This particular dataset was chosen to build a network of the companies. Links between each company that represents a node are made according to the links in the Wikipedia article. Running several network analysis the goal is to find interesting facts like communities that are not only formed by type, industry or country. Maybe certain parent companies with their subsidiaries are found. Can it be that there are a few super companies that run the world? We hope to find some answers to this question.

Furthermore, a sentiment analysis will help to gain insights into the general mood on companies on [Twitter](https://twitter.com/?lang=en). Who is the most hated company in Europe at the moment and does it play a very central row overall? Is Volkswagen still hit by the emission scandal or Samsung by the Galaxy Note 7 disaster? Is Twitter maybe dominated by the happy commercials of the companies?

Following a general outline of the explainer notebook covering [required deliverables](https://github.com/suneman/socialgraphs2016/wiki/Assignments#more-on-the-explainer-notebook) surrounding basic stats, tools, theory and analysis:

* 1\. Data Extraction 
    * 1.1\. API Data Gathering
    * 1.2\. Additional Extraction
    * 1.3\. Data Merging
    * 1.4\. Adding Geo Location
    * 1.5\. Data Cleaning   
* 2\. Preliminary Data Analysis
    * 2.2\. Company Analysis
    * 2.2\. Company Statistics
    * 2.3\. Company Graphs
* 3\. Network Construction
    * 3.1\. Alternative Construction
* 4\. Network Analysis
    * 4.1\. Degree Distribution
    * 4.2\. Power-laws and Friendship Paradox
    * 4.3\. Centrality
    * 4.4\. Assortativity
    * 4.5\. Modularity
    * 4.6\. Communities
    * 4.7\. Network Visualizations and Statistics
* 5\. Sentiment Analysis of Twitter Data
    * 5.1\. API Data Gathering
    * 5.2\. Happiness Averages of Companies
    * 5.3\. Wordclouds
* 6\. Discussion

## 0. Load Modules

Each needed module is loaded in order to keep the overview of already loaded modules. Also certain files and directories where data can be loaded from are specified. The reason is many that data crawling and some network analysis can take some time.

In [1]:
# IPython global cell magic
%reset
%matplotlib inline

# import all necessary packages
import bs4 # HTML parser
from collections import Counter, OrderedDict # counting elements and ordering keys in dictionaries
import community # python-louvain package
import datetime # handle date objects
import dateparser # parse any (also foreign) date format to object: https://pypi.python.org/pypi/dateparser
from __future__ import division # all numbers are float
import gc # garbage collector
import geoplotlib # plot points on tiled maps
from geoplotlib.utils import BoundingBox
import geopy # get geo location according to addresses
from geopy.exc import GeocoderServiceError
from infomap import infomap # python infomap algorithm, needs to be in same directory
import itertools # iterators for efficient looping
import json # JSON parser
import math # math operations
from matplotlib import pyplot as plt # plotting figures
import mwparserfromhell # parse MediaWiki syntax: https://github.com/earwig/mwparserfromhell
from nameparser import HumanName # parse a human name
import networkx as nx # networks creation library
import nltk # natural language processing
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
import operator # efficient operator functions
import os # operating system operations, e.g.: with files and folders
import pandas as pd # use easy-to-use data frames for data analysis
import pickle # python data structures as files
from pprint import pprint # print data structures prettier
import re # regex
import requests # request URL content
import sys # system operations
import time # sleep timer
from tqdm import tqdm_notebook # make a nice progressbar
import urllib # handle special URL chars

# make working directory
directory = os.getcwd() + '/companies'
if not os.path.exists(directory):
    os.makedirs(directory)

# files from data crawling
ex1_fdat = directory + '/extraction1_data.pkl'
ex2_fdat = directory + '/extraction2_data.pkl'
ex3_tmp_fdat = directory + '/tmp_extraction3_data.pkl'
ex3_fdat = directory + '/extraction3_data.pkl'
merged = directory + '/merged_data.pkl'
# network files
network_f = directory + '/network.pkl'
network_red_f = directory + '/reduced_network.pkl'
gephi_f = directory + 'gehpi.gexf'
# pandas data
extraction_csv = directory + '/company_data.csv'

# specify nltk data dir, otherwise LookupError
nltk.data.path.append(os.getcwd() + '/../nltk_data')
from nltk.corpus import names

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## 1. Data Extraction
### 1.1. API Data Gathering

The first approach to detect companies on Wikipedia is by looking for a template called ["Infobox company"](https://en.wikipedia.org/wiki/Template:Infobox_company) that is embedded on the site. It contains certain fields that are also used as extra data that can be used for some data analysis later on, e.g.:
* `key_people` - CEO and other important postitions in the company.
* `founded` - The date when the company was founded.
* `hq_location` - The address of the company.
* etc.

Example query for extraction of pages with "Infobox company" template:

* https://en.wikipedia.org/w/api.php?&action=query&list=embeddedin&eititle=Template:Infobox+company&einamespace=0&eilimit=max
* https://en.wikipedia.org/w/api.php?&action=query&list=embeddedin&eititle=Template:Infobox+company&einamespace=0&eilimit=max&eicontinue=0|167638

Each query returns a maximum of 500 Wikipedia pages. Over a continue parameter the next results can be queried. That returned data is in JSON format and thus can be easily parsed into a Python dictionary.

Some functions are defined and used throughout the notebook. Most of the functions have a description.

In [2]:
def get_json_from_url(url):
    """URL request that returns JSOn data if parsable."""

    r = requests.get(url)
    
    # on HTML error codes
    if r.status_code != 200:
        return None

    # try converting into JSON
    try:
        sec = r.json()
        return sec
    except ValueError:  # includes simplejson.decoder.JSONDecodeError
        print 'WARN: Decoding JSON has failed on:', url
    return None

In [3]:
def get_comp_pages(api_url, extension=None):
    """Query to get company wikipedia pages with embedded template."""
    
    # extend the input URL with continue parameter
    ext_api_url = api_url
    if extension:
        ext_api_url += extension
    
    # retrieve JSON
    sites_w_template = get_json_from_url(ext_api_url)

    # check for valid data
    if not sites_w_template or 'query' not in sites_w_template:
        print "WARN: query returned no data:", infobox_url + properties
    else:
        # iterate over list of companies that embed infobox and save to dict
        for page in sites_w_template['query']['embeddedin']:
            # unicode to str
            quoted_name = urllib.quote_plus(page['title'].encode('utf-8'))
            companies.update({
                page['title']: {
                    'wiki_page_id': int(page['pageid']),
                    'wiki_name': unicode(page['title']),
                    'wiki_url': u'{0}{1}?title={2}'.format(wiki_base, wiki_index, quoted_name),
                    'name_url_quoted': unicode(quoted_name)
                }
            })

        # run the function recursively, when company found
        if 'continue' in sites_w_template:
            extension = '&eicontinue=' + sites_w_template['continue']['eicontinue']
            return get_comp_pages(api_url, extension)
    
    return companies

All initial data will be stored in a dictonary. First the data necessary to crawl each company Wikipedia page individually is gathered.

In [4]:
# specify wikipedia API URL parameters
wiki_base = u'https://en.wikipedia.org'
wiki_api = u'/w/api.php'
wiki_index = u'/w/index.php'
action = u'action=query'
dat_format = u'format=json'
# eititle=Template:Infobox+company - embedded pages with Template {{Infobox Company}}
# einamespace=0 only articles
# eilimit=max - maximum number before continuing to minimize requests
properties = u'list=embeddedin&eititle=Template:Infobox+company&einamespace=0&eilimit=max'

# concatenate URL
infobox_url = '{0}{1}?&{2}&{3}&'.format(
    wiki_base, wiki_api, action, dat_format)

# init and update with recursive function
companies = dict()
get_comp_pages(infobox_url + properties)

print len(companies), "companies were extracted."

56893 companies were extracted.


A small excerpt of gathered information.

In [5]:
company = 'AT&T'
print "Dictonary excerpt of", company, "after searching for Infobox company template:"
print companies[company]

Dictonary excerpt of AT&T after searching for Infobox company template:
{'name_url_quoted': u'AT%26T', 'wiki_page_id': 17555269, 'wiki_url': u'https://en.wikipedia.org/w/index.php?title=AT%26T', 'wiki_name': u'AT&T'}


A lot of cleaning and proper parsing of the each individual field in the template has to be done. Since everyone is free to edit any wikipedia page there are certain problems on many pages. Individual cases are handled in the next functions. The module [mwparserfromhell](https://github.com/earwig/mwparserfromhell) is used to parse so called templates in the extracted wikitext.

Here are some other modules that are used:
* [dateparser](https://pypi.python.org/pypi/dateparser) for `defunct` and `founded` date fields
* [nameparser](https://pypi.python.org/pypi/nameparser) for `key_people` field

Also notice that even with this method there are problematic non-company subjects like: [Logo of the BBC](https://en.wikipedia.org/wiki/Logo_of_the_BBC) but then usually the name in infobox fits: British Broadcasting Corporation. However, it would be more problematic to create a link list according to the `name` in the company infobox. Multiple companies with the same name will be addressed later.

In [5]:
def get_temp_val(temp, k):
    """Check if the parameter exists in the infobox template"""

    try:
        param = temp.get(k)
        return param
    except ValueError:
        # try alternative keys (template not always consistent)
        if k == 'founded':
            return get_temp_val(temp, 'foundation')
        if k == 'founders':
            return get_temp_val(temp, 'founder')
        if 'location' in k and 'hq_' not in k:
            return get_temp_val(temp, 'hq_' + k)
    # key not found in template
    return None

In [6]:
def parse_employee_nr(input_val, c_name):
    """
    Parse a proper number out of the various inputs
    Case possibilities tested:
    
    s_list = [
        '15-50',
        '~300+',
        'circa 40',
        '9,985(Dec 2011)',
        'over 10,000 in 10 countries',
        'Five',
        'over 1 million',
        '10.000',
        'Part of Popular, Inc., which has 8,000 employees'
    ]

    for s in s_list:
        print parse_employee_nr(s, 'test')
    """
    
    # match the first number, dot or comma separation optional
    m = re.search(r'[0-9]+([,\.][0-9]+)?', unicode(input_val))
    if m:
        try:
            # replace , and conert to int
            return int(m.group().replace(',', '').replace('.', ''))
        except ValueError:
            print "WARN: Failed conversion of:{0} (company: {1})".format(
                input_val, c_name)
    return None

In [48]:
def parse_people_name(names):
    """
    Tim Cook (CEO) becomes 
    {u'last': u'Cook', u'suffix': u'', u'title': u'', u'middle': u'', u'nickname': u'CEO', u'first': u'Tim'}
    """

    if isinstance(names, list):
        # if n makes sure no empty list value
        return [HumanName(unicode(n)).as_dict() for n in names if n]
    elif names:
        # parse the single name and return as list
        return [HumanName(unicode(names)).as_dict()]
    return names

In [8]:
def parse_date(date):
    """
    Parse a date, also handles complicated cases and other languages like:
    Martes 21 de Octubre de 201 -> datetime.datetime(2014, 10, 21, 0, 0) [Spanish (Tuesday 21 October 2014)]
    """

    if date: 
        # if only the year is given returns todays date in that year
        date_parsed = dateparser.parse(date, settings={
            'PREFER_DATES_FROM': 'past',
            'DATE_ORDER': 'YMD'})
        # try to extract the first number with 4 digits from 18XX-20XX
        if not date_parsed:
            m = re.search(r'(18|19|20\d{2})', date)
            if m:
                return dateparser.parse(m.group())
        else:
            return date_parsed
    # return null time: 00:00:00
    return datetime.time()

In [9]:
def parse_wiki_raw(param, k, c_name):
    """Extract the information on the raw value without notable wiki markup"""

    # strip_code does not work properly for [[File:: https://github.com/earwig/mwparserfromhell/issues/136
    if k == 'logo':
        split_val = unicode(param.value).strip().split('|')
        # first value always image link
        if split_val[0]:
            # avoid bs4 warning
            val = split_val[0]
            if not re.match(r'http://.*', val):
                val = bs4.BeautifulSoup(val, 'lxml').text
            val = val.replace('[[', '').replace(' ', '_')
            if val:
                # can be written without [[File: or [[Image:
                if 'File:' in val or 'Image:' in val:
                    pass
                else:
                    val = 'File:' + val
                # submit raw version too, also links can differ
                return {
                    'wiki_commons_link': 'https://commons.wikimedia.org/wiki/' + val,
                    'wiki_file_link': wiki_base + '/wiki/' + val,
                    'wiki_raw_code': param.value.strip()}
    
    # fields can contain break separations, e.g. Microsoft: 
    # [[John W. Thompson]] <small> ([[Chairman]]) </small> <br /> [[Satya Nadella]]
    if '<br' in param.value:
        # replace the HTML breaks with real newline
        param.value = re.sub(r'<br>|<br ?/>', '\n', unicode(param.value))
        # get rid of media wiki markup and split into parts
        val = unicode(param.value.strip_code())
        # strip code does not always remove HTML tags
        val = bs4.BeautifulSoup(val, 'lxml').text
        items = val.split('\n')
        if k == 'key_people' or k == 'founders':
            return parse_people_name(items)
        if k == 'num_employees':
            return parse_employee_nr(items[-1], c_name)
        return items

    # only assuming one value when no breaks, too vague to use other seperators than <br>
    val = param.value.strip_code()

    # avoid bs4 warning when parsing HTML link
    if k != 'homepage':
        # properly remove HTML remains
        val = bs4.BeautifulSoup(val, 'lxml').text
    if k == 'homepage':
        # there are cases like: [http://www.absn.tv/ ABS] for homepage/website that don't match URL template
        match_link = re.search(r'\[(.+)\]', unicode(param.value))
        if match_link:
            val = match_link.group(1)
        # if whitespaces, then still no valid link
        if ' ' in val:
            # HTML needs to be ignored, e.g.: <!-- {{URL|www.example.com}} --> (also strips text)
            rem_html = bs4.BeautifulSoup(val, 'lxml').text
            if rem_html:
                return param.value.strip().split(' ')[0].replace('[', '')
            else:
                return None
    if k == 'num_employees':
        return parse_employee_nr(val, c_name)

    # make sure string didn't just contain whitespaces
    if val:
        return unicode(val).strip()
    # it can be that the field exists but is empty
    return None

In [10]:
def parse_wiki_template(param, k, c_name):
    """Extract further information if templates in key value"""

    for tem in param.value.filter_templates():
        # handle dates, URL and list
        if k == 'founded' or k == 'defunct':
            if tem.name.matches(('start date and age', 'start date', 'end date')):
                # valid date: {{Start date and age|2003|January|5|df=1}}
                date = [unicode(p) for p in tem.params if '=' not in unicode(p)]
                # concatenate the date and give to dateparser
                if not date:
                    return datetime.time()
                return parse_date("/".join(date))
            else:
                return datetime.time()
        elif k == 'homepage' and tem.name.matches(('url', 'URL')):
            if tem.params:
                try:
                    url = tem.get(1)
                except ValueError:
                    print "WARN: Could not get first element in params:", tem.params
                    return unicode(tem.params[0]).strip()
                return unicode(url).strip()
        elif tem.name.matches('unbulleted list'):
            # replace wiki link markup and remove extra citations etc.
            items = [unicode(p.value.strip_code().strip()) for p in tem.params]
            if k == 'key_people' or k == 'founders':
                return parse_people_name(items)
            if k == 'num_employees':
                # e.g.: {{unbulleted list|{{loss}}0 (7 February 2013)|850 (6 February 2013)}}
                # usually last value contains current number
                return parse_employee_nr(items[-1], c_name)
            return items
        elif tem.name.matches(('plainlist', 'flatlist')):
            if tem.params:
                items = unicode(tem.params[0].value.strip_code().strip()).split('\n')
                if k == 'key_people' or k == 'founders':
                    # use nameparse to exclude titles, etc.
                    return parse_people_name(items)
                return items
        elif 'formatnum' in tem.name and k == 'num_employees':
            # {{formatnum:1234}}, colon not detected by mwparser
            m = re.search(r'formatnum:(.*)', unicode(tem.name))
            if m:
                return parse_employee_nr(m.group(1), c_name)
        elif 'nowrap' in tem.name or 'nihongo' in tem.name or 'small' in tem.name:
            if k == 'name':
                return unicode(tem.params)

    # if the template didn't match anything parse raw text
    return parse_wiki_raw(param, k, c_name)

In [11]:
def parse_wiki_text(company):
    """Parse MediaWiki markup to extract detailed information"""

    # remove ref tags, bs4 adds unwanted body tags, thus regex better
    wiki_raw = re.sub(r'<ref.+?</ref>|<ref>.+?</ref>', '', company['wiki_raw'])
    
    # sometimes name can not be attached when parsing
    company['name'] = company['wiki_name']

    comp_infobox = dict()
    # parse the wikimedia syntax
    # style can be ignored: https://github.com/earwig/mwparserfromhell/issues/115
    code = mwparserfromhell.parse(wiki_raw, skip_style_tags=True)
    # filter for the infobox
    # If matches is a regex, the flags passed to re.search() are re.IGNORECASE ...
    c_template = code.filter_templates(
        matches=r'infobox company|infobox_company|company infobox|infobox dot-com company')
    if not c_template:
        # try regex approach for any infobox, if nothing found above
        match = re.search(r'({{infobox.*\n(?:\|.*\n|\*.*\n)+}})', wiki_raw, re.IGNORECASE)
        if match:
            infobox_temp = mwparserfromhell.parse(
                match.group(1), skip_style_tags=True).filter_templates()
            if infobox_temp:
                infobox_temp = infobox_temp[0]
            else:
                print "WARN: No parsable company infobox template for:", company['wiki_name']
                return comp_infobox
        else:
            print "WARN: No company infobox found for:", company['wiki_name']
            return comp_infobox
    else:
        infobox_temp = c_template[0]

    # find values for each key
    key_list = ['name', 'logo', 'type', 'founders', 'key_people', 'industry', 'founded', 'location', \
                'location_city', 'location_country', 'defunct', 'subsid', \
                'products', 'num_employees', 'parent', 'homepage']
    for k in key_list:
        param = get_temp_val(infobox_temp, k)
        if param:
            val = parse_wiki_template(param, k, company['wiki_name'])
            if val:
                comp_infobox[k] = val

    return comp_infobox

In [13]:
def get_c_info(company, repeat=None):
    """Get links and the wikitext which is used to extract key information about the company"""

    # reparse old text if text and links already there
    if repeat or 'wiki_raw' not in company or 'all_links' not in company:   
        # concatenate URL from previously set values
        link_url = '{0}{1}?&{2}&{3}&{4}&pageid={5}'.format(
            wiki_base, wiki_api, action, properties, dat_format, company['wiki_page_id'])

        # get JSON and check for validity
        c_content = get_json_from_url(link_url)
        if not c_content or 'parse' not in c_content:
            print 'WARN: No parsable content on: {0} under page {1} (id: {2})'.format(
                link_url, company['wiki_name'], company['wiki_page_id'])
            company['is_company'] = False
            return company

        # save new fields
        company['wiki_api_url'] = unicode(link_url)
        company['all_links'] = [x['*'] for x in c_content['parse']['links'] if x['ns'] == 0]

        # original wikitext to parse {{Infobox company
        company['wiki_raw'] = c_content['parse']['wikitext']['*']
    
    # extract company info from box, only on wiki text not HTML
    company_infobox = parse_wiki_text(company)
    company.update(company_infobox)
    company['is_company'] = True
    
    # keep non-company nodes but mark
    if not company_infobox:
        company['is_company'] = False
    return company

Here is an excerpt of the company infobox of the Google page:
```
{{Infobox company
| name          = Google Inc.
| logo          = Google 2015 logo.svg
| logo_size     = 225
| logo_alt = The letters of "Google" are each purely colored (from left to right) with blue, red, yellow, blue, green, and red.
| logo_caption  =
| image         = Googleplex-Patio-Aug-2014.JPG
| image_size    = 250
| image_caption = The [[Googleplex]] headquarters in 2014
| type          = [[Subsidiary]]
| area_served      = Worldwide
| key_people       = [[Sundar Pichai]] ([[CEO]])
| industry      = {{plainlist|
* [[Internet]]
* [[Software|Computer software]]
* [[Computer hardware]]
}}
| products         = [[List of Google products]]
| num_employees    = 57,100 (Q2 2015)<ref>{{cite web|url=http://www.businessinsider.com/google-has-57000-employees-2015-7|title=Google's hiring may have slowed, but it's still adding thousands of new employees|publisher=[[Business Insider]]}}</ref>
| parent           = [[Alphabet Inc.]]<br><small>(2015–present)</small>
| subsid           = [[List of mergers and acquisitions by Alphabet|List of subsidiaries]]
| footnotes        = <ref>{{cite web|url=http://investor.google.com/proxy.html|title=Google Inc. Annual Reports |date=July 28, 2014|publisher=Google Inc.|accessdate=August 29, 2014}}</ref>
| homepage         = {{URL|https://www.google.com}}
| foundation       = {{start date and age|1998|09|4}}<!-- Do not change this to September 27: every year they celebrate at a different date, but the company was founded on September 4. Also, do not add that it is x years old as Google is, obviously, not a human and therefore the age is not very relevant. --><br />[[Menlo Park, California]]<ref>{{cite web|title=Company|url=https://www.google.com/intl/en/about/company/|publisher=Google|accessdate=January 16, 2015}}</ref><ref>{{cite web|last=Claburn|first=Thomas|title=Google Founded By Sergey Brin, Larry Page... And Hubert Chang?!?|url=http://www.informationweek.com/news/internet/google/210603678|publisher=InformationWeek|accessdate=August 31, 2011}}</ref>
| founders         = {{plainlist|
* [[Larry Page]]
* [[Sergey Brin]]
}}
| location_city    = [[Googleplex]], [[Mountain View, California|Mountain View]], [[California]]
| location_country = U.S.<ref>{{cite web|url=https://www.google.com/about/jobs/locations/ |title=Locations&nbsp;— Google Jobs |publisher=Google.com |accessdate=September 27, 2013}}</ref>
}}
```

And the parsed result:
```
 'founded': datetime.datetime(1998, 9, 4, 0, 0),
 'homepage': u'https://www.google.com',
 'industry': [u'Internet', u' Computer software', u' Computer hardware'],
 'is_company': True,
 'key_people': u'Sundar Pichai (CEO)',
 'location_city': u'Googleplex, Mountain View, California',
 'location_country': u'U.S.',
 'logo': {'wiki_commons_link': u'https://commons.wikimedia.org/wiki/File:Google_2015_logo.svg',
          'wiki_file_link': u'https://en.wikipedia.org/wiki/File:Google_2015_logo.svg',
          'wiki_raw_code': u'Google 2015 logo.svg'},
 'name': u'Google Inc.',
 'name_url_quoted': 'Google',
 'num_employees': 57100,
 'parent': [u'Alphabet Inc.', u'(2015\u2013present)'],
 'products': u'List of Google products',
 'subsid': u'List of subsidiaries',
 'type': u'Subsidiary',
```

Unfortunately not all fields are always detected properly but that issue is adressed in the cleaning part.

For each page links and the wikitext is extracted and saved along with the data fields from the template. Example queries to extract the information from a Wikipedia page:
* https://en.wikipedia.org/w/api.php?action=parse&page=Audi&prop=links|wikitext
* https://en.wikipedia.org/w/api.php?action=parse&page=AT%26T&prop=links|wikitext
* https://en.wikipedia.org/w/api.php?&action=parse&format=json&prop=links|wikitext&pageid=17555269

In [43]:
# change the main URL parameters
action = u'action=parse'
dat_format = u'format=json'
properties = u'prop=links|wikitext'

# other: 'Audi', 'Apple Inc.', 'Microsoft'
company = 'Candover Investments'

print "Excerpt of", company, "dict structure:"
pprint(
    get_c_info(merged_companies[company], repeat=True))

Excerpt of Candover Investments dict structure:
{'all_links': [u'Candover',
               u'3i',
               u'ABN AMRO',
               u'Advent International',
               u'AlpInvest Partners',
               u'American Capital Strategies',
               u'Angel investor',
               u'Apax Partners',
               u'Apollo Global Management',
               u'Ares Management',
               u'Assets under management',
               u'BC Partners',
               u'Bain Capital',
               u'Berkshire Partners',
               u'Bridgepoint Capital',
               u'Business Development Company',
               u'Business incubator',
               u'Buyout',
               u'Buy\u2013sell agreement',
               u'CCMP Capital',
               u'CVC Capital Partners',
               u'Capital call',
               u'Capital commitment',
               u'Capital structure',
               u'Capitalization table',
               u'Carlyle Group',
             

The function `get_c_info()` will now be apllied for each company in a loop.

In [56]:
def scrape_companies(companies, repeat=None):
    """Gets two dictionaries and parses those who don't have same content"""

    # scrape links and extra information of all companies, call with progress-bar
    for company in tqdm_notebook(companies, desc='Companies'):
        # don't repeat if pickle file already contains data
        if not repeat and 'is_company' in companies[company] and \
            companies[company]['is_company']:
                continue
        # extract any company information
        companies[company] = get_c_info(companies[company])
    return companies

A file containing previous scraped data is loaded. All companies that intersect with the previous data are updated to the old data. For complete reparsing that step can be skipped.

In [139]:
# get company file from extraction 1
com_dat_pickle = dict()
if os.path.isfile(ex1_fdat):
    with open(ex1_fdat, 'rb') as f:
        com_dat_pickle = pickle.load(f)
        # update previously fetched data to same keys
        companies.update(com_dat_pickle)

Each query and parsing with regular expressions takes some time for the approximately 57.000 companies. A rough estimate of 5 hours crawling time was made. There are sometimes warning when bare URL strings are parsed with the HTML parser [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) but the warning can be ignored as it is just used to strip HTML tags, sometimes fields contain a bare URL for instance at the `logo` field.

In [19]:
%%time
# 500 companies take about 2-3 mins
# for 64k that is approx. 128 cykles which should total maximum 5 hours crawl time
# ((64000 / 500) * 2) / 60

# crawl for all companies that were not prcoessed yet and cotain no or a false 'is_company' flag
companies = scrape_companies(companies)

WARN: No company infobox found for: McDonald's legal cases
WARN: No company infobox found for: History of Burger King
WARN: No company infobox found for: KFC advertising
WARN: No company infobox found for: Burger King products
WARN: No company infobox found for: Burger King legal issues
WARN: No company infobox found for: History of McDonald's
WARN: No company infobox found for: History of KFC
WARN: No company infobox found for: McDonald's advertising
WARN: No company infobox found for: Burger King advertising
WARN: No company infobox found for: BEHR Group Holdings LLP
WARN: No company infobox found for: Fondo Común
WARN: No company infobox found for: List of Burger King ad programs
WARN: No company infobox found for: List of McDonald's ad programs
WARN: No parsable content on: https://en.wikipedia.org/w/api.php?&action=parse&prop=links|wikitext&format=json&pageid=52437332 under page ONMI audio (id: 52437332)
WARN: No company infobox found for: List of McDonald's products

CPU times: u

In [22]:
# store company data in one binary file
with open(ex1_fdat, 'wb') as f:
    pickle.dump(companies, f)

### 1.2. Additional Extraction

Additional extraction was conducted in order to get more data. However, it is very hard to verify if page links are actual companies. Two alternate data sources where used:

1. [Companies portal](https://en.wikipedia.org/wiki/Portal:Companies) which aims to collect data and interesting facts about companies. They provide a [table](https://tools.wmflabs.org/enwp10/cgi-bin/list2.fcgi?run=yes&projecta=Company&limit=1000) with approximately 48.000 companies. Nevertheless, this table also contains [stubs](https://en.wikipedia.org/wiki/Wikipedia:Stub), many companies that actually have the "Infobox company" template and other irrelevant/non-companies.
2. Another approach was to use a [list of companies by country](https://en.wikipedia.org/wiki/Category:Lists_of_companies_by_country). But also this list had some shortcomings with regards to identify what is a company link. although many of the pages have either bullet lists or tables some sections contain non-companies in the enumerations. For example the [list of companies in Ireland](https://en.wikipedia.org/wiki/List_of_companies_of_Ireland) has a section about company types i.e. [private company](https://en.wikipedia.org/wiki/Private_company_limited_by_shares). These kind of subjects have a lot of links but a hard to detect.

The answer to why additional data is even necessary is that some prominent companies like [Yahoo](https://en.wikipedia.org/wiki/Yahoo!) might be missing because it has another used template called "Infobox dot-com company". Obviously it is hard to keep track of all these different cases.

With these shortcomings the idea was to verify a page title over a third party source called [OpenCorporates](https://opencorporates.com/) with over 115M verified companies from official company registers. However, their API has a very strict API restriction of 500 lookups per month. They also had the API possibility to retrieve [network data](https://opencorporates.com/info/networks) about subsidiaries of companies. Unfortunately a more sophisticated API access was denied although 2 mails have been exchanged.

Instead the Copenhagen based company [WikiBusiness](http://wikibusiness.org/) was more interested in helping us realize our idea.

The two approaches of extracting extra data have been documented in two separate notebooks:
* [Data extration part 2](extraction2.ipynb)
* [Data extration part 3](extraction3.ipynb)

### 1.3. Data Merging

In this part all the data will be merged into one dictionary. With this set of companies the links that a company has to all other companies can be collected in a link set.

In [141]:
# delete old objects to save some memory
del companies, com_dat_pickle
gc.collect()

0

In [142]:
# load all pickle files that contain separate company dictionaries
extraction1, extraction2, extraction3 = dict(), dict(), dict()
if os.path.isfile(ex1_fdat):
    with open(ex1_fdat, 'rb') as f:
        extraction1 = pickle.load(f)
if os.path.isfile(ex2_fdat):
    with open(ex2_fdat, 'rb') as f:
        extraction2 = pickle.load(f)
if os.path.isfile(ex3_fdat):
    with open(ex3_fdat, 'rb') as f:
        extraction3 = pickle.load(f)

In [145]:
def create_link_set(extr_companies):
    """If valid company then add to set of all companies."""

    all_c = set()
    for c in extr_companies:
        if 'is_company' in extr_companies[c] and extr_companies[c]['is_company']:
            all_c.add(c)
    return all_c

In [146]:
# create unique set of all companies
ex1_links = create_link_set(extraction1)
ex2_links = create_link_set(extraction2)
ex3_links = create_link_set(extraction3)
# union of sets
all_companies = ex1_links | ex2_links | ex3_links

print "Number of all companies and valid companies from extraction 1:"
print len(extraction1), len(ex1_links)

print "Number of all companies and valid companies from extraction 2:"
print len(extraction2), len(ex2_links)

print "Number of all companies and valid companies from extraction 3:"
print len(extraction3), len(ex3_links)

print "\nTotal of valid companies from all extractions:"
print len(all_companies)

Number of all companies and valid companies from extraction 1:
56962 56898
Number of all companies and valid companies from extraction 2:
0 0
Number of all companies and valid companies from extraction 3:
0 0

Total of valid companies from all extractions:
56898


Now each individual set will be merged.

In [147]:
# load the old data
merged_companies = dict()
if os.path.isfile(merged):
    with open(merged, 'rb') as f:
        merged_companies = pickle.load(f)

# iterate and merge each extracted data
dicts = [extraction1, extraction2, extraction3]
for d in dicts:
    for c_name, comp in d.iteritems():
        if comp['is_company']:
            # list of links intersected with list containing all companies
            comp['links'] = all_companies.intersection(comp['all_links'])
            # on first iteration set merged_companies
            if c_name not in merged_companies:
                merged_companies[c_name] = comp
                continue
            # valid values set in extraction 1 are eventually overwritten by extraction 2 or 3
            for k, v in comp.iteritems():
                # if value not None
                if v:
                    comp[k] = v
            # set additional keys and values to merged companies
            merged_companies[c_name].update(comp)

print "Final amount of companies with link lists:", len(merged_companies)

Final amount of companies with link lists: 56898


In [150]:
# free up some memory
del extraction1, extraction2, extraction3, all_companies
gc.collect()

0

### 1.4. Adding Geo Location

Geo location is added in order to make some plots and illustrations according to the origin of the company. Mainly [OpenStreetMap Nominatim](https://wiki.openstreetmap.org/wiki/Nominatim) is used to resolve places strings to GPS coordinates. Alternatively the [Google Geocoding API (V3)](https://developers.google.com/maps/documentation/geocoding/start) can be used with an API key. An OpenSource alternative named [Photon](https://photon.komoot.de/) seems to be buggy in the current version of the Python library [geopy](https://pypi.python.org/pypi/geopy/1.11.0) that was used to crawl the data.

In [170]:
def get_location(comp, geolocators, fields_to_check=None):
    """Try to retrieve address and GPS using different geolocators."""
    
    # location variables
    loc_str = None
    loc_dict = {
        'location_geo': loc_str,
        'location_gps': (None, None)}
    if 'location_geo' in comp and 'location_gps' in comp:
        # if not None
        if comp['location_geo'] and comp['location_gps']:
            return {
                'location_geo': comp['location_geo'],
                'location_gps': comp['location_gps']}
    
    # check from first to last element for location, order important 
    # location_city usually headquarters
    if not fields_to_check:
        fields_to_check = ['location_city', 'location', 'location_country']
    for f in fields_to_check:
        # continue if field not in company
        if f not in comp:
            # delete from checkable fields
            continue
        # check if value unicode or list
        val = comp[f]
        if val:
            loc_str = val
            # join lists like [u'9th Floor, Equity Centre', u' Hospital Road, Upper Hill', u'Nairobi, Kenya']
            if isinstance(val, list):
                loc_str = " ".join(val)
        
        # fields are ordered, highest with value wins
        if loc_str:
            fields_to_check.remove(f)
            break
    # return if none of the fields set
    if not loc_str:
        return loc_dict

    # avoid request by looking into already processed
    if loc_str in all_locations:
        return all_locations[loc_str]

    # try to validate and get coordinates with geopy
    for g in geolocators:
        try:
            loc = g.geocode(loc_str, timeout=10, exactly_one=True)
            if 'nominatim' in g.domain:
                # max 1 request per second, http://wiki.openstreetmap.org/wiki/Nominatim_usage_policy
                time.sleep(1)
            if loc:
                loc_dict = {
                    'location_geo': loc.address,
                    'location_gps': (loc.latitude, loc.longitude)}
                all_locations[loc_str] = loc_dict
                return loc_dict            
        # IndexError seems when Photon returns no result
        except (GeocoderServiceError, IndexError):
            print "HTML, Python or API error on", comp['wiki_name'], "checking: ", loc_str

    # if more strings can be checked
    if fields_to_check:
        return get_location(comp, geolocators, fields_to_check)
    else:
        return loc_dict

Crawling geolocation will also take some time especially because there are [limitations](http://wiki.openstreetmap.org/wiki/Nominatim_usage_policy).

In [171]:
### initialize the different geolocator APIs
# https://photon.komoot.de/ - OpenSource
g_photon = geopy.geocoders.Photon()
# http://nominatim.openstreetmap.org/search.php?q=
g_osm = geopy.geocoders.Nominatim(
    user_agent='Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0')
# should be only with API key otherwise GeocoderQuotaExceeded
g_google = geopy.geocoders.GoogleV3()
#geolocators = [g_google, g_photon, g_osm]
geolocators = [g_osm]

# save looked up locations to not make API request
all_locations = dict()
for c in tqdm_notebook(merged_companies, desc='Geolocation'):    
    processed_loc = get_location(merged_companies[c], geolocators)
    merged_companies[c].update(processed_loc)

HTML, Python or API error on Claro Americas checking:  Flagship headquarters  Brazil   Local headquarters   Argentina   Chile   Colombia   Costa Rica   Dominican Republic  Ecuador  El Salvador  Guatemala  Honduras    Nicaragua  Panama Paraguay  Peru   Uruguay
HTML, Python or API error on Phreesia checking:  432 Park Ave South New York, New York, 10016 U.S.
HTML, Python or API error on Ainol checking:  Lijincheng Industry Park, Shenzhen, Guangdong
HTML, Python or API error on BINDER (company) checking:  Tuttlingen
HTML, Python or API error on Holborn Assets LLC checking:  Dubaihttp://www.informazione.it/c/D3C67367-9506-403B-BC4C-EB3C3B5845F5/Holborn-Assets-International-Financial-Planner-to-Educate-South-Africans-on-Offshore-Trust-and-Tax-Planninghttp://www.indiainfoline.com/prnewswire?doc=201607280719PR_NEWS_EURO_ND__enUK201607281843_Public&dir=2http://goevnts.com/pr-landing?rkey=20160728enUK201607281843_Public&filter=4444http://globalnewsweek.com/prnewswire/index.html?doc=201607280719

### 1.5. Data Cleaning

Each field should follow a specific layout and each company should contain all fields even though the might be `None`. As an example the main dictonary `companies` contains a key `Google` which is a dictonary itself and contains fields like `key_people`. Furthermore `key_people` should be a list of dictonaries where each dictonary contains the name split in fields like `first_name`, `last_name` and so on.

In [173]:
# specify the type for each field
types = {
    # when first link is crawled
    'wiki_name': unicode,
    'wiki_url': unicode,
    'name_url_quoted': unicode,

    # when WIKI API is crawled
    'wiki_page_id': int,
    'wiki_api_url': unicode,
    'all_links': list,
    'links': set,
    'is_company': bool,
    'wiki_raw': unicode,

    # added only when Infobox company exists or fields from OpenCorporates
    # not all fields always exist, they are NaN in the resulting DataFrame
    'name': unicode,
    'type': unicode,
    'founded': datetime.datetime,
    'defunct': datetime.datetime,
    'location': unicode,
    'location_city': unicode,
    'location_country': unicode,
    'location_geo': unicode,
    'location_gps': tuple,
    # following not in OC
    'countries': set, # multiple possible from extraction 3
    'logo': dict,
    'key_people': list, # additonally processed with nameparser.HumanName (dict)
    'founders': list, # same as key_people
    'industry': list,
    'subsid': list,
    'products': list,
    'num_employees': int,
    'parent': unicode,
    'homepage': unicode
}

Especially often lists occur where there is only supposed to be one value. These and certain other cases that also involve reparsing are done.

In [174]:
for c, comp in tqdm_notebook(merged_companies.iteritems(), desc='Cleaning'):
    for k, val in comp.iteritems():
        if not val:
            continue
        # list values that are not supposed to be lists, to single value
        if types[k] != list and isinstance(val, list):
            if k == 'num_employees':
                comp[k] = parse_employee_nr(val[0], c)
            elif types[k] == datetime.datetime:
                 comp[k] = parse_date(val[0])
            else:
                comp[k] = val[0]
        # string values to unicode
        elif types[k] == unicode and isinstance(val, str):
            comp[k] = unicode(val)
        # single values with one element to list
        elif types[k] == list and not isinstance(val, list):
            # only one dict, e.g. in key_people, to list
            if k == 'key_people' or k == 'founders':
                # if not a dict, reparse
                if not isinstance(val, dict):
                    comp[k] = parse_people_name(val)
            else:
                comp[k] = [val]
        # single unicode values to datetime obj
        elif types[k] == datetime.datetime and isinstance(val, unicode):
                comp[k] = parse_date(val)
    # fill in all keys with None if they don't exist
    comp.update(
        {key: None for key in types.keys() if key not in comp}) 




Check if all fields where converted successfully to their designated types.

In [175]:
# reiterate to check conversion result
failed_types = dict()
for c, comp in merged_companies.iteritems():
    for k, val in comp.iteritems():
        # show values that do not fit specified type
        if val and not isinstance(val, types[k]):
            # just delete non extracted logos
            if k == 'logo':
                comp[k] = dict()
            if k not in failed_types:
                failed_types.update({
                    k: {
                        'is_type': type(val),
                        'should_be_type': types[k]}})

print "Failed types:", failed_types

Failed types: {'logo': {'is_type': <type 'unicode'>, 'should_be_type': <type 'dict'>}}


Often the field `key_people` contains the full name under the `last` name. That happens on the common string: _Ulrich Hackenberg, Head of Technical Development_

Another problem that occurs when a lot of names are just divided by a comma: _Alan Dukes (Chairman), Mike Aynsley (Group CEO)_ those cases are not handled.

See the following example:

In [176]:
pprint(merged_companies['Irish Bank Resolution Corporation']['key_people'])
pprint(merged_companies['Audi']['key_people'])

[{u'first': u'Mike',
  u'last': u'Alan Dukes',
  u'middle': u'Aynsley',
  u'nickname': u'Chairman Group CEO',
  u'suffix': u'',
  u'title': u''}]
[{u'first': u'Chairman of the Board of Management',
  u'last': u'Rupert Stadler',
  u'middle': u'',
  u'nickname': u'',
  u'suffix': u'',
  u'title': u''},
 {u'first': u'Head of Design',
  u'last': u'Marc Lichte',
  u'middle': u'',
  u'nickname': u'',
  u'suffix': u'',
  u'title': u''},
 {u'first': u'Head of Technical',
  u'last': u'Ulrich Hackenberg',
  u'middle': u'Development',
  u'nickname': u'',
  u'suffix': u'',
  u'title': u''}]


If the string is matched the element will be reparsed.

In [177]:
for c, comp in merged_companies.iteritems():
    if 'key_people' in comp and comp['key_people']:
        people = list()
        for person in comp['key_people']:
            # if a space seperates the name in last
            if 'last' in person and re.match(r'.+\ .+', person['last']):
                 people.append(parse_people_name(person['last']))
        if people:
            comp['key_people'] = people
    if 'founders' in comp and comp['founders']:
        founders = list()
        for person in comp['founders']:
            # if a space seperates the name in last
            if 'last' in person and re.match(r'.+\ .+', person['last']):
                 people.append(parse_people_name(person['last']))
        if founders:
            comp['founders'] = founders

In [178]:
pprint(merged_companies['Irish Bank Resolution Corporation']['key_people'])
pprint(merged_companies['Audi']['key_people'])

[[{u'first': u'Alan',
   u'last': u'Dukes',
   u'middle': u'',
   u'nickname': u'',
   u'suffix': u'',
   u'title': u''}]]
[[{u'first': u'Rupert',
   u'last': u'Stadler',
   u'middle': u'',
   u'nickname': u'',
   u'suffix': u'',
   u'title': u''}],
 [{u'first': u'Marc',
   u'last': u'Lichte',
   u'middle': u'',
   u'nickname': u'',
   u'suffix': u'',
   u'title': u''}],
 [{u'first': u'Ulrich',
   u'last': u'Hackenberg',
   u'middle': u'',
   u'nickname': u'',
   u'suffix': u'',
   u'title': u''}]]


Print statistics and certain keys to show the conversion result.

In [179]:
loc = 0
for c in merged_companies:
    if 'location_gps' in merged_companies[c] and merged_companies[c]['location_gps']:
        if merged_companies[c]['location_gps'][0] and merged_companies[c]['location_gps'][1]:
            loc += 1

print loc, "companies of total", len(merged_companies), "have a resolved GPS location.\n"

test_excerpt = 'Google'
print "Excerpt of", test_excerpt, "dict structure:"
include = types.keys()
include.remove('all_links')
include.remove('wiki_raw')
pprint(
    {k: merged_companies[test_excerpt][k] for k in include})

49646 companies of total 56898 have a resolved GPS location.

Excerpt of Google dict structure:
{'countries': None,
 'defunct': None,
 'founded': datetime.datetime(1998, 9, 4, 0, 0),
 'founders': [{u'first': u'Larry',
               u'last': u'Page',
               u'middle': u'',
               u'nickname': u'',
               u'suffix': u'',
               u'title': u''},
              {u'first': u'Sergey',
               u'last': u'Brin',
               u'middle': u'',
               u'nickname': u'',
               u'suffix': u'',
               u'title': u''}],
 'homepage': u'https://www.google.com',
 'industry': [u'Internet', u' Computer software', u' Computer hardware'],
 'is_company': True,
 'key_people': [{u'first': u'Sundar',
                 u'last': u'Pichai',
                 u'middle': u'',
                 u'nickname': u'CEO',
                 u'suffix': u'',
                 u'title': u''}],
 'links': set([u'ADATA',
               u'AOL',
               u'ARCA Space Cor

All merged data is also saved to a file for convenience reasons.

In [200]:
with open(merged, 'wb') as f:
    pickle.dump(merged_companies, f)

## 2. Preliminary Data Analysis
### 2.2. Company Analysis

In order to gain some insights into the companies a analysis of the current data is done along with some additional cleaning. As data structure a DataFrame from the [pandas](http://pandas.pydata.org/) library is used.

In [3]:
# convert from dict into dataframe
comp_df = pd.DataFrame.from_dict(merged_companies, orient='index')
comp_df.index.name = 'wiki_title'
comp_df

Unnamed: 0_level_0,links,location_geo,founded,name_url_quoted,logo,subsid,location_city,wiki_url,wiki_name,location,...,homepage,industry,key_people,location_country,products,all_links,location_gps,wiki_page_id,wb_api_url,wb_api_search_url
wiki_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
&pizza,"{Ruby Tuesday (restaurant), City Sports}",,2012-12-02 00:00:00,%26pizza,,,,https://en.wikipedia.org/w/index.php?title=%26...,&pizza,,...,http://www.andpizza.com/,,,,,"[City Sports, DC Central Kitchen, Fast casual ...","(None, None)",47858476,,
+Beryll,"{Henri Bendel, Fred Segal, Los Angeles Times}",,2006-12-02 00:00:00,%2BBeryll,"{u'wiki_raw_code': u'Beryll logo.jpg', u'wiki_...",,,https://en.wikipedia.org/w/index.php?title=%2B...,+Beryll,,...,beryll.com/,[Fashion accessories],"[{u'last': u'Designer', u'suffix': u'', u'titl...",,"[luxury goods, designer sunglasses]","[Angelina Jolie, Anna Pacquin, Austria, Bauhau...","(None, None)",13860681,,
...instore,"{Home Bargains, Heron Foods, Tesco, Poundstret...","Huddersfield, Yorkshire and the Humber, Englan...",2003-12-02 00:00:00,...instore,{u'wiki_raw_code': u'[[File:Instore-logo.png]]...,,,https://en.wikipedia.org/w/index.php?title=......,...instore,"Huddersfield, England, United Kingdom",...,http://www.poundstretcher.co.uk/,,,,,"[Aldi, Asda, BBC News Online, B & M, Bargain B...","(53.6467031, -1.7832076)",9291975,,
01 Communique,{},"Ont., Peel Region, Ontario, Canada",1992-12-02 00:00:00,01+Communique,"{u'wiki_raw_code': u'01 Communique Logo.svg', ...",,,https://en.wikipedia.org/w/index.php?title=01+...,01 Communique,"Mississauga, Ontario, Canada",...,http://www.01com.com,[Software],"[{u'last': u'', u'suffix': u'', u'title': u'Pr...",,"[Remote Access Software, Online Help Desk Supp...","[Arlington, Virginia, Citrix, I'm InTouch, Mis...","(43.5892854, -79.6441645)",15244876,,
01 Distribution,{RAI},"Lomé, Togo",2001-06-21 00:00:00,01+Distribution,,,Rome,https://en.wikipedia.org/w/index.php?title=01+...,01 Distribution,,...,,"[films, animation]","[[{u'last': u'Brocco', u'suffix': u'', u'title...",Italy,,"[Animation, Film distribution, Film industry, ...","(6.130419, 1.215829)",46724099,,
07th Expansion,{Alchemist (company)},日本,,07th+Expansion,,,,https://en.wikipedia.org/w/index.php?title=07t...,07th Expansion,Japan,...,,"[Sound novels, Video game industry, Interactiv...",,,"[Higurashi no Naku Koro ni, Umineko no Naku Ko...","[Alchemist (company), Comiket, Crunchyroll, Dō...","(36.5748441, 139.2394179)",5936289,,
0verflow,{},"神田, 神田ふれあい通り, 鍛冶町2, 鍛冶町, 東京, 千代田区, 東京都, 101-00...",1997-12-02 00:00:00,0verflow,"{u'wiki_raw_code': u'0verflowlogo.png', u'wiki...",,"Kanda, Chiyoda , Tokyo",https://en.wikipedia.org/w/index.php?title=0ve...,0verflow,,...,http://www.0verflow.com,"[Interactive entertainment, Brand novelties]","[{u'last': u'Ōnuma', u'suffix': u'', u'title':...",Japan,"[School Days, Summer Days, Cross Days]","[Anime, Anime News Network, CEO, Chiyoda, Toky...","(35.6917842, 139.770917)",10387049,,
1-2-3 (fuel station),{Statoil Fuel & Retail},,2000-12-05 00:00:00,1-2-3+%28fuel+station%29,,,,https://en.wikipedia.org/w/index.php?title=1-2...,1-2-3 (fuel station),,...,,,,,,"[Baltic states, Denmark, Fuel station, Kaunas,...","(None, None)",14018485,,
1-800 Contacts,"{Johnson & Johnson, DITTO, CooperVision, AEA I...","Draper, Utah, United States of America",1995-12-02 00:00:00,1-800+Contacts,,,,https://en.wikipedia.org/w/index.php?title=1-8...,1-800 Contacts,"Draper, Utah",...,http://www.1800contacts.com,[Contact lens retail],"[[{u'last': u'Coon', u'suffix': u'', u'title':...",,[Contact lenses],"[1-800 Contacts, Inc. v. WhenU.com, Inc., AEA ...","(40.5246711, -111.8638225)",4613366,,
1-800-FREE-411,"{Google, Liberty Media, Tellme Networks, March...","U, 4200, Mary Gates Memorial Drive Northeast, ...",2005-12-02 00:00:00,1-800-FREE-411,"{u'wiki_raw_code': u'800free411.gif', u'wiki_f...",,,https://en.wikipedia.org/w/index.php?title=1-8...,1-800-FREE-411,"Seattle, WA, U.S.",...,http://www.free411.com,[Telecommunications],"[{u'last': u'', u'suffix': u'', u'title': u'CE...",,[1-800-FREE411 directory service],"[4-1-1, 800-The-Info, Android (operating syste...","(47.66003045, -122.290454247)",18480351,,


Companies with missing name are eliminated. Those name misses should not occur because the Wikipedia link name is taken if it is missing in the "Infobox company" but this is as insurance.

In [4]:
comp_df.dropna(subset=['name'], inplace=True)

Companies with duplicate names can occur if people put similar company templates on pages, e.g.:
*
*
*

In [5]:
# find duplicate company names
c_dupl = pd.concat(g for _, g in comp_df.groupby('name') if len(g) > 1)
print "Found", len(c_dupl['name']), "duplicates:"
print c_dupl['name']

Found 426 duplicates:
wiki_title
Big Bazaar                                                      '''Big Bazaar'''
Big Bazaar (Rourkela)                                           '''Big Bazaar'''
Hanriot                                           ''Aéroplanes Hanriot et Cie.''
Hanriot (aircraft company)                        ''Aéroplanes Hanriot et Cie.''
Automobiles Darracq France                                      A Darracq et Cie
Darracq and Company London                                      A Darracq et Cie
ASICS                                                          ASICS Corporation
Haglöfs                                                        ASICS Corporation
Abrazo Scottsdale Campus                                Abrazo Scottsdale Campus
Abrazo Scottsdale Campus Arizona                        Abrazo Scottsdale Campus
Al Marjan Island                                                Al Marjan Island
Al Marjan Island LLC                                            Al Marjan Is

In [6]:
# add extra column adding the length of the link set
for i, row in c_dupl.iterrows():
  c_dupl.set_value(i, 'link_amount', len(row['links']))

In [7]:
s_list = [ 'List of Yamaha guitars',
          'Yamaha Corporation',
          'Yamaha electric guitar models']

for s in s_list:
    print s, c_dupl.loc[s, 'link_amount']

List of Yamaha guitars 1.0
Yamaha Corporation 331.0
Yamaha electric guitar models 5.0


In [8]:
# sort after multiple values, sorts the columns with least links and employees
sort_cols = ['name', 'link_amount', 'num_employees']
c_dupl_sort = c_dupl.sort_values(by=sort_cols, ascending=[True,True,True])
c_dupl_sort[['link_amount', 'name', 'num_employees']]

Unnamed: 0_level_0,link_amount,name,num_employees
wiki_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Big Bazaar (Rourkela),3.0,'''Big Bazaar''',
Big Bazaar,4.0,'''Big Bazaar''',36000.0
Hanriot,40.0,''Aéroplanes Hanriot et Cie.'',
Hanriot (aircraft company),40.0,''Aéroplanes Hanriot et Cie.'',
Darracq and Company London,7.0,A Darracq et Cie,
Automobiles Darracq France,9.0,A Darracq et Cie,
Haglöfs,1.0,ASICS Corporation,200.0
ASICS,19.0,ASICS Corporation,5937.0
Abrazo Scottsdale Campus,8.0,Abrazo Scottsdale Campus,650.0
Abrazo Scottsdale Campus Arizona,8.0,Abrazo Scottsdale Campus,650.0


In [9]:
# now drop the one's with duplicate names, will take the second duplicate with less links
c_dupl_sort.drop_duplicates('name', inplace=True)
c_dupl_sort

Unnamed: 0_level_0,links,location_geo,founded,name_url_quoted,logo,subsid,location_city,wiki_url,wiki_name,location,...,industry,key_people,location_country,products,all_links,location_gps,wiki_page_id,wb_api_url,wb_api_search_url,link_amount
wiki_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Big Bazaar (Rourkela),"{Future Group, Rourkela Steel Plant, Big Bazaar}",,2014-12-20 00:00:00,Big+Bazaar+%28Rourkela%29,"{u'wiki_raw_code': u'Big Bazaar Logo.jpg', u'w...",,,https://en.wikipedia.org/w/index.php?title=Big...,Big Bazaar (Rourkela),,...,[Retailing],"[{u'last': u'Nayak', u'suffix': u'', u'title':...",,[Hypermarket],"[Akshaya Patra Foundation, Asian Workers Devel...","(None, None)",46454193,,,3.0
Hanriot,"{Air Sylphe, Gourdou-Leseurre, Loire Aviation,...",France,1907-12-02 00:00:00,Hanriot,,,"Bétheny, Boulogne-Billancourt, Carrières-sur-S...",https://en.wikipedia.org/w/index.php?title=Han...,Hanriot,,...,"[Aeronautics, defence]",,France,[Aircraft],"[France, ABS Aerolight, ANF Les Mureaux, Abrah...","(46.603354, 1.8883335)",12681181,,,40.0
Darracq and Company London,"{Heenan & Froude, Automobiles Talbot France, C...","Suresnes, Arrondissement de Nanterre, Hauts-de...",2016-12-18 00:00:00,Darracq+and+Company+London,,,,https://en.wikipedia.org/w/index.php?title=Dar...,Darracq and Company London,"Suresnes, France",...,[Automotive],"[[{u'last': u'Darracq', u'suffix': u'', u'titl...",,[Automobiles],"[Acton, London, Adolphe Clément-Bayard, Alexan...","(48.8710147, 2.2252883)",30862570,,,7.0
Haglöfs,{ASICS},Sverige,1914-12-02 00:00:00,Hagl%C3%B6fs,"{u'wiki_raw_code': u'Image:Logo Haglofs.png', ...",,,https://en.wikipedia.org/w/index.php?title=Hag...,Haglöfs,Sweden,...,[outdoor equipment],,,"[hardware, clothing, footwear]","[ASICS, Outdoor equipment, Parent company, Swe...","(59.6749712, 14.5208584)",4446963,,,1.0
Abrazo Scottsdale Campus,"{Abrazo Community Health Network, Tenet Health...","Phoenix, Maricopa County, Arizona, United Stat...",1983-12-02 00:00:00,Abrazo+Scottsdale+Campus,,,,https://en.wikipedia.org/w/index.php?title=Abr...,Abrazo Scottsdale Campus,"Phoenix, Arizona",...,[Health Care],,,"[Health care Services, Emergency room services...","[Abrazo Community Health Network, Abrazo Healt...","(33.4485866, -112.0773455)",48338967,,,8.0
Al Marjan Island LLC,{},"Ras Al Khaimah, ‏رأس الخيمة‎, الإمارات العربيّ...",2013-12-02 00:00:00,Al+Marjan+Island+LLC,,,Ras Al Khaimah,https://en.wikipedia.org/w/index.php?title=Al+...,Al Marjan Island LLC,,...,[Real estate],,United Arab Emirates,,"[Ras Al Khaimah, The National (Abu Dhabi), Uni...","(25.7737705, 55.938232)",47113267,,,0.0
Alchemy Boulders,{},"Kandy, මහනුවර දිස්ත්‍රික්කය, Central Province,...",1996-12-02 00:00:00,Alchemy+Boulders,{u'wiki_raw_code': u'[[File:Alchemy Boulders l...,,,https://en.wikipedia.org/w/index.php?title=Alc...,Alchemy Boulders,"Kandy, Sri Lanka",...,[Mining & Mineral Processing],,,,"[Family owned, High end, Japan, Kandy, Mineral...","(7.2930922, 80.6350768)",13352678,,,0.0
"Amiga, Inc. (South Dakota)","{Commodore International, MetaComCo, Amiga, In...","600, North Derby Lane, North Sioux City, Union...",1997-12-02 00:00:00,Amiga%2C+Inc.+%28South+Dakota%29,,,,https://en.wikipedia.org/w/index.php?title=Ami...,"Amiga, Inc. (South Dakota)","600 N. Derby Lane, North Sioux City, South Dakota",...,,"[[{u'last': u'Schindler', u'suffix': u'', u'ti...",,"[A1200, Power A5000, AmigaOS 4, AmigaOS 5]","[A1200, ACube Systems Srl, AROS Research Opera...","(42.5252972785, -96.4960886743)",14893446,,,10.0
Amplify (company),"{Asus, News Corp}",,2000-12-02 00:00:00,Amplify+%28company%29,,,,https://en.wikipedia.org/w/index.php?title=Amp...,Amplify (company),"55 Washington Street\nSuite 900\nBrooklyn, NY ...",...,[Education],"[{u'last': u'Klein', u'suffix': u'', u'title':...",,"[Amplify Tablet, digital curriculum, assessmen...","[Amplify Tablet, Android (operating system), A...","(None, None)",38755162,,,2.0
Andersen Tax,"{HSBC, Arthur Andersen}","SF, California, United States of America",2002-12-02 00:00:00,Andersen+Tax,,,,https://en.wikipedia.org/w/index.php?title=And...,Andersen Tax,"San Francisco, California",...,[Professional Services],"[{u'last': u'', u'suffix': u'', u'title': u'CE...",,,"[Arthur Andersen, Family office, HSBC, Hedge f...","(37.7792808, -122.4192362)",29840306,,,2.0


In [12]:
# now drop the rows of our main dataframe which are in the dataframe with the unwanted duplicates
dupl_companies = list(c_dupl_sort.index.values)
comp_df.drop(dupl_companies, inplace=True)

A good export format is a CSV file the properties of lists and dictionaries are lost but that is actually an advantage exporting to `gexf` for visualization in gephi.

In [13]:
comp_df.to_csv(extraction_csv, encoding='utf-8')

### 2.2. Company Statistics

This section is about revealing some things that would not be visible by simply studying the table. To achieve this there is always a simple iteration over all values and the top 10 are generated using the `Counter` object from the [collections](https://docs.python.org/2/library/collections.html) library.

#### Most Links per Company

In below cell the number of links for each company is counted.

In [81]:
cnt = Counter()
# iterate every row that represents a company
for index, row in comp_df.iterrows():
    # empty link list
    if isinstance(row['links'], float):
        continue
    cnt[row['name']] = len(row['links'])

print "Top 10 companies with most links:"
pprint(cnt.most_common(10))

Top 10 companies with most links:
[(u'Telia Company AB', 529),
 (u'Hitachi, Ltd.', 524),
 (u'Vodafone Group', 519),
 (u'Sony Corporation', 511),
 (u'Panasonic Corporation', 489),
 (u'Toshiba Corporation', 488),
 (u'Kyocera Corporation', 487),
 (u'Mitsubishi Electric Corporation', 466),
 (u'Comcast Corporation', 461),
 (u'Fujitsu Ltd.', 460)]


Technology companies seem to have a lot of references to other companies.

#### Most Employees per Company

Next the companies are sorted according to number of employees.

In [83]:
cnt = Counter()
for index, row in comp_df.iterrows():
    if pd.isnull(row['num_employees']):
        continue
    cnt[row['name']] = row['num_employees']

print "Highest number of employees by company:"
pprint(cnt.most_common(10))

Highest number of employees by company:
[(u'UMW Holdings Berhad', 1100000.0),
 (u'JSC Russian Railways', 942808.0),
 (u'Rostec', 900000.0),
 (u'China Post Group Corporation', 860200.0),
 (u'Tata Group', 660800.0),
 (u'G4S plc', 618000.0),
 (u'Volkswagen AG', 610076.0),
 (u'Volkswagen', 610000.0),
 (u'People Ready', 600000.0),
 (u'Tesco PLC', 597784.0)]


The companies with most employees are unsurprisingly in area's where many people live.

#### Names of People over all Companies
The next cells are used to find most influential people's names. Regardless of the data cleaning efforts some entries are just to different from each other so some things need to be cleared. 

In [84]:
def check_if_real(p):
    """The human parser is not as accurate, so we need to sort out some false names"""
    if p and p.isalpha() and \
        not re.search(r'President|Chairman|Manag|Founder|VP|Officer|CTO|CEO|CFO|COO|Director', p, re.IGNORECASE):
            return True
    return False

In [85]:
females = set(names.words('female.txt'))

def count_people(key):
    cnt_first, cnt_last, cnt_female = Counter(), Counter(), Counter()
    for index, comp in merged_companies.iteritems():
        if not comp[key]:
            continue
        for person in comp[key]:
            if isinstance(person, dict):
                if check_if_real(person['first']):
                    cnt_first[person['first']] += 1
                    if person['first'] in females:
                        cnt_female[person['first']] += 1
                if check_if_real(person['last']):
                    cnt_last[person['last']] += 1

    print "Most common male first names by company key people:"
    pprint(cnt_first.most_common(10))

    print "Most common female first names by company key people:"
    pprint(cnt_female.most_common(10))

    print "Most common last names by company key people:"
    pprint(cnt_last.most_common(10))

In [86]:
count_people('key_people')

Most common male first names by company key people:
[(u'John', 690),
 (u'David', 571),
 (u'Michael', 420),
 (u'Robert', 314),
 (u'Mark', 307),
 (u'Peter', 295),
 (u'Paul', 267),
 (u'Richard', 264),
 (u'James', 254),
 (u'William', 206)]
Most common female first names by company key people:
[(u'Chris', 174),
 (u'George', 146),
 (u'Daniel', 116),
 (u'Tim', 102),
 (u'Frank', 99),
 (u'Bill', 97),
 (u'Tony', 82),
 (u'Andy', 80),
 (u'Alex', 73),
 (u'Lee', 66)]
Most common last names by company key people:
[(u'Smith', 118),
 (u'Lee', 72),
 (u'Jones', 62),
 (u'Brown', 59),
 (u'Taylor', 55),
 (u'Williams', 53),
 (u'Miller', 50),
 (u'Johnson', 47),
 (u'Wilson', 45),
 (u'Singh', 45)]


In [87]:
count_people('founders')

Most common male first names by company key people:
[(u'John', 491),
 (u'David', 324),
 (u'William', 311),
 (u'Robert', 246),
 (u'Michael', 245),
 (u'James', 197),
 (u'Peter', 197),
 (u'Charles', 192),
 (u'George', 189),
 (u'Paul', 186)]
Most common female first names by company key people:
[(u'George', 189),
 (u'Chris', 129),
 (u'Frank', 105),
 (u'Bill', 105),
 (u'Daniel', 100),
 (u'Alex', 68),
 (u'Tony', 54),
 (u'Fred', 49),
 (u'Tim', 48),
 (u'Sam', 45)]
Most common last names by company key people:
[(u'Smith', 82),
 (u'Johnson', 46),
 (u'Williams', 41),
 (u'Miller', 41),
 (u'Brown', 39),
 (u'Group', 39),
 (u'Anderson', 38),
 (u'Lee', 37),
 (u'Cohen', 37),
 (u'Gupta', 33)]


In [184]:
for index, comp in merged_companies.iteritems():
    if not comp['key_people']:
        continue
    for person in comp['key_people']:
        if isinstance(person, dict):
            if person['first'] == 'John' and  person['last'] == 'Smith':
                print "Actual people called John Smith in company:", comp['wiki_name']

Actual people called John Smith in company: GB Railfreight


As you can see John Smith is not only a very common name but also a very successful businessman. At least one could argue that if your first name is John or your last name is Smith not necessarily both. Also we are shocked by the female names which are probably all males.

### 2.3. Company Graphs

In this part there are some basics graphs revealing how the distribution of crawled companies is.

In the following cells the geodata of companies is gathered for specific areas. These points will be accumulated in a kernel density map (also heatmap) with the [geoplotlib](https://github.com/andrea-cuttone/geoplotlib) module.

In [193]:
def get_all_geodata(dataset, bounds=None):
    """Get geodata for specific latitude and longitude input bounds."""

    # filter bad rows
    dataset = dataset[dataset.location_gps.notnull()]
    
    # ignore the warning about chained assignments
    pd.options.mode.chained_assignment = None 
    # make 2 extra columns for split longitude and latitude
    dataset['lat'], dataset['lon'] = zip(*dataset.location_gps)    
    
    # only activity in the boundaries
    if not bounds:
        include = dataset.location_gps.notnull()
    else:
        include = (dataset.lat > bounds[0]) & \
        (dataset.lat < bounds[1]) & \
        (dataset.lon > bounds[2]) & \
        (dataset.lon < bounds[3])
        
    # get data in the format geoplotlib requires. We put the geodata in a dictionary structured as follows
    geo_coords = dataset.loc[include].location_gps.tolist()
    geo_data = {
        "lat": [x[0] for x in geo_coords if x[0]], 
        "lon": [x[1] for x in geo_coords if x[1]]
    }
    return geo_data

4 plots will be created one for all points, the USA, Europe and Asia (including China, Japan, South Korea, India). The plots are inline in the notebook.

In [194]:
# create the dictionary with lat and lon
geodat_all_comp = get_all_geodata(comp_df)
geodat_us_comp = get_all_geodata(
    comp_df,
    # [min lat, max lat, min lon, max lon)]
    [24.9493, 49.5904, -125.0011, -66.9326]
)
geodat_eu_comp = get_all_geodata(
    comp_df,
    # [min lat, max lat, min lon, max lon)]
    [27.6363, 70, -25, 40]
)
geodat_asia_comp = get_all_geodata(
    comp_df,
    # [min lat, max lat, min lon, max lon)]
    [0, 43, 70, 160]
)

In [195]:
def geo_plot(geodata):
    """Plot given coordinate input."""

    # bounding box on the minima and maxima of the data
    geoplotlib.set_bbox(
        BoundingBox(
            max(geodata['lat']), 
            max(geodata['lon']), 
            min(geodata['lat']), 
            min(geodata['lon'])
        ));
    
    # kernel density estimation visualization
    geoplotlib.kde(geodata, bw=5, cut_below=1e-3, cmap='hot', alpha=170)
    # google tiles with lyrs=y ... hybrid
    geoplotlib.tiles_provider({
        #'url': lambda zoom, xtile, ytile: 'https://mt1.google.com/vt/lyrs=y&hl=en&x=%d&y=%d&z=%d' % (xtile, ytile, zoom),
        'url': lambda zoom, xtile, ytile: 'https://maps.wikimedia.org/osm-intl/%d/%d/%d.png' % (zoom, xtile, ytile),
        'tiles_dir': 'DTU-social_graphs',
        'attribution': 'DTU - 02805 Social graphs and interactions'
    })
    
    geoplotlib.inline();

In [196]:
print 'all companies'
geo_plot(geodat_all_comp)

all companies
('smallest non-zero count', 1.4329573088768091e-09)
('max count:', 21.740096339281834)


In [197]:
print 'US companies'
geo_plot(geodat_us_comp)

US companies
('smallest non-zero count', 1.4329573088768091e-09)
('max count:', 10.57326025508225)


Digging deeper into a more specific country outline the hotspots of big cities with many companies become visible. For the USA there is New York, San Francisco and Los Angeles. The hotspot at the centroid of the map are those companies who just have the field `location_country` set to USA. 

In [198]:
print 'EU companies'
geo_plot(geodat_eu_comp)

EU companies
('smallest non-zero count', 7.1647865443840454e-10)
('max count:', 14.035635838058237)


In Europe the companies are much more distributed and not only in specific regions although London and Paris stick out as hot spots.

In [199]:
print 'Asia companies'
geo_plot(geodat_asia_comp)

Asia companies
('smallest non-zero count', 1.9360402507283206e-08)
('max count:', 3.3940539208278131)


In Asia there is not as many company distributions also because there is probably not as much data.

#### General Histogram Distribution Analysis in accordance with Found and Defunct year 

In [101]:
com_founded_years=list()
com_defounded_years=list()
acient=0
for com, com_detail in comp_df.iteritems():
    if com_detail['founded']: 
        found_year=int(com_detail['founded'].year)
        if found_year>=1800 and found_year<=int(datetime.date.today().year):
            com_founded_years.append(int(found_year))
        elif found_year<1800: 
            acient=acient+1 #count of com eariler than 1800
    if com_detail['defunct']:
        defound_year=int(com_detail['defunct'].year)
        if defound_year>=1800 and defound_year<=int(datetime.date.today().year):
            com_defounded_years.append(int(defound_year))     

            
#compute defunct rate
com_founded_dict= dict()
com_founded_cum=dict()
com_founded_count=Counter(com_founded_years)

for key in sorted(com_founded_count.iterkeys()):
    com_founded_dict[key]= com_founded_count[key]
    
    acient=acient+com_founded_count[key]
    com_founded_cum[key]= acient

defunt_cum=0
com_defounded_dict= dict()

com_defounded_count=Counter(com_defounded_years)
for key in sorted(com_defounded_count.iterkeys()):
    com_defounded_dict[key]= com_defounded_count[key]
    
    defunt_cum=defunt_cum+com_defounded_count[key]
    com_founded_cum[key]=com_founded_cum[key]-defunt_cum
#patching defunct dict
years_no_defunct=set(com_defounded_dict.keys()).symmetric_difference(com_founded_dict.keys())
for year_no_defunct in years_no_defunct:
    com_defounded_dict[year_no_defunct]=0
    
defunct_rates=list()
for cum, defunct in zip(com_founded_cum.values(),com_defounded_dict.values()):
    defunct_rates.append((defunct/cum)*100)


#plot histogram
fig = plt.figure(num=None, figsize=(20, 20), dpi=1200)
ax = fig.add_subplot(211)
count, bins, ignored = ax.hist(com_founded_years, bins=50,alpha=0.5, range=(min(com_founded_years), max(com_founded_years)),label="Found")
count, bins, ignored = ax.hist(com_defounded_years, bins=50,alpha=0.5, range=(min(com_defounded_years), max(com_defounded_years)),label="Defunct")

ax.legend(loc=2,fontsize=15)
ax.set_title("Number of companies founded/defunct after 1800",fontsize=20)
ax.set_xlabel('Timeline',fontsize=15)
ax.set_ylabel('Count',fontsize=15)
plt.xticks(list(plt.xticks()[0]) + [2008]) #add special tick for 2008, possible financial crisis influence

ax = fig.add_subplot(212)
ax.bar(com_founded_dict.keys(),defunct_rates,color='y')
ax.set_title("Defunct rate since 1800",fontsize=20)
ax.set_xlabel('Timeline',fontsize=15)
ax.set_ylabel('Rate',fontsize=15)


plt.xticks(list(plt.xticks()[0]) + [2008]) #add special tick for 2008, possible financial crisis influence
plt.show()

wiki_title
&pizza                                        {Ruby Tuesday (restaurant), City Sports}
+Beryll                                  {Henri Bendel, Fred Segal, Los Angeles Times}
...instore                           {Home Bargains, Heron Foods, Tesco, Shoprite (...
01 Communique                                                                       {}
01 Distribution                                                                  {RAI}
07th Expansion                                                   {Alchemist (company)}
0verflow                                                                            {}
1-2-3 (fuel station)                                           {Statoil Fuel & Retail}
1-800 Contacts                       {Johnson & Johnson, DITTO, CooperVision, AEA I...
1-800-FREE-411                       {Google, Liberty Media, Tellme Networks, March...
1-800-Flowers                        {CompuServe, AOL, Comtex, Martha Stewart Livin...
1-800-GOT-JUNK?                 

KeyError: 'founded'

Companies by age of foundation.

## 3. Network Construction

### 3.1. Alternative Construction

## 4. Network Analysis

Below analysis 

### 4.1. Degree Distribution

### 4.2. Power-laws and Friendship Paradox

Next step is to prove the [Friendship paradox](https://en.wikipedia.org/wiki/Friendship_paradox). This paradox states that _almost everyone_ have fewer friends than their friends have, on average. 
A consequence of having a network with a power-law degree distribution. The explanation is that almost everyone is friends with a hub, that drives up the average degree of the friends.

Pick a node $i$ at random (e.g. use `random.choice`). [Find its degree](http://networkx.lanl.gov/reference/generated/networkx.Graph.degree.html) and [neighbors](http://networkx.lanl.gov/reference/generated/networkx.Graph.neighbors.html).

In [None]:
# pick random node
i = rand.choice(G.nodes())
# find its degree
i_deg = G.degree(i)
# find neighbours
i_neighs = G.neighbors(i)

In [None]:
def get_average_degree(graph, nodes=None, print_edges=False):
    if not nodes:
        nodes = graph.nodes()
    deg_graph = graph.degree(nbunch=nodes)
    sum_of_edges = sum(deg_graph.values())
    deg = sum_of_edges/len(nodes)
    if print_edges:
        return deg, deg_graph.values()
    return deg

Now we find the neighbors average degree and compare the averages to check if it's true that $i$'s friends (on average) have more friends than $i$.

In [None]:
# calculate average degree of neighbors
i_neighs_avg_deg = get_average_degree(G, i_neighs, print_edges=False)

# show details
print 'Random node i {0} with degree of {1} has neighbour(s) {2}'.format(i, i_deg, i_neighs)
print 'Neighbour(s) {0} of random node i {1} has/have average degree of {2}'.format(i_neighs, i, i_neighs_avg_deg)

This will be done a 1000 times to see the percantage of cases how often the paradoxon is true.

In [None]:
# try 1000 times how often paradoxon true
paradoxon_true = 0
trials = 1000
for x in xrange(0, trials):
    i = rand.choice(G.nodes())
    i_deg = G.degree(i)
    i_neighs = G.neighbors(i)
    i_neighs_avg_deg = get_average_degree(G, i_neighs, print_edges=False)
    if i_neighs_avg_deg > i_deg:
        paradoxon_true += 1
print "Out of {0} trials the friendship paradoxon was true {1} time(s).".format(trials, paradoxon_true)

### 4.3. Centrality

### 4.4. Assortativity

### 4.5. Modularity

### 4.6. Communities

### 4.7. Network Visualizations and Statistics


##  5. Sentiment Analysis of Twitter Data


###  5.1. API Data Gathering

###  5.2. Happiness Averages of Companies


###  5.3. Wordclouds

## 6. Discussion. 

Think critically about your creation:
* What went well?,
* What is still missing? What could be improved?, Why?