** 02805 Social graphs and interactions **

# Data Extraction Part 3

## Explanation

First this was the method planned for the extraction of company data before the "Infobox company" template was detected. Unfortunately, the site where to different so there were always links appearing that were no companies. The articles could have been verified with the method in part 2 but not with the API limitiations,

Nevertheless, the documented steps are shown below.

In [1]:
# IPython global cell magic
%reset
%matplotlib inline

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


In [2]:
# import all necessary packages
import bs4 # HTML parser
from collections import Counter, OrderedDict # counting elements and ordering keys in dictionaries
import community # python-louvain package
import datetime # handle date objects
import dateparser # parse any (also foreign) date format to object: https://pypi.python.org/pypi/dateparser
from __future__ import division # all numbers are float
import gc # garbage collector
import geoplotlib # plot points on tiled maps
from geoplotlib.utils import BoundingBox
import geopy # get geo location according to addresses
from geopy.exc import GeocoderServiceError
from infomap import infomap # python infomap algorithm, needs to be in same directory
import itertools # iterators for efficient looping
import json # JSON parser
import math # math operations
from matplotlib import pyplot as plt # plotting figures
import mwparserfromhell # parse MediaWiki syntax: https://github.com/earwig/mwparserfromhell
from nameparser import HumanName # parse a human name
import networkx as nx # networks creation library
import nltk # natural language processing
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
import numpy as np # powerful calculation library
import operator # efficient operator functions
import os # operating system operations, e.g.: with files and folders
import pandas as pd # use easy-to-use data frames for data analysis
import pickle # python data structures as files
from pprint import pprint # print data structures prettier
import random as rand # pick thing at random
import re # regex
import requests # request URL content
import sys # system operations
import time # sleep timer
from tqdm import tqdm_notebook # make a nice progressbar
import urllib # handle special URL chars

In [3]:
# make working directory
directory = os.getcwd() + '/companies'
if not os.path.exists(directory):
    os.makedirs(directory)

# files from data crawling
ex1_fdat = directory + '/extraction1_data.pkl'
ex2_fdat = directory + '/extraction2_data.pkl'
ex3_tmp_fdat = directory + '/tmp_extraction3_data.pkl'
ex3_fdat = directory + '/extraction3_data.pkl'
merged = directory + '/merged_data.pkl'
# network files
network_f = directory + '/network.pkl'
network_red_f = directory + '/reduced_network.pkl'
gephi_f = directory + '/gehpi.gexf'

# specify nltk data dir, otherwise LookupError
nltk.data.path.append(os.getcwd() + '/../nltk_data')
from nltk.corpus import names

## 1. Investigate Country Lists for Companies

The Page with lists for companies by country is divided in two sections. The main interest is in pages in the category "Lists of companies by country" under the second heading. These links lead directly to a list of companies. The links under the first heading, however, contain links to subsections. Thus it can happen that some countries don't have a direct list. These links must be parsed individually.

In [4]:
# parse HTML
wiki_cat_url = 'https://en.wikipedia.org/wiki/Category:Lists_of_companies_by_country'
r = requests.get(wiki_cat_url)
wiki_soup = bs4.BeautifulSoup(r.text, 'lxml')

# first heading
wiki_subcats = wiki_soup.find("div", attrs={"id": "mw-subcategories"})

# second heading
wiki_countries = wiki_soup.find("div", attrs={"id": "mw-pages"})

In [5]:
def get_c_names(wiki_section):
    # each letter has new unordered list
    links2d = [l.find_all('a') for l in wiki_section.find_all('ul')]
    # combine all links from each list
    links = list(itertools.chain.from_iterable(links2d))

    c_list = list()
    for l in links:
        if 'Lists of companies of' in l.text:
            country = l.text.split('Lists of companies of ')[-1]
        elif 'List of' in l.text:
            country = l.text.split('List of companies of ')[-1]
        else:
            # ignore lists like airlines by country
            continue
        if country.startswith('the '):
            country = country[len('the '):]
        c_list.append(country)
    return c_list

In [6]:
country_list_subcat = get_c_names(wiki_subcats)
country_list_page = get_c_names(wiki_countries)

print 'Country list extracted from "Subcategory" section:'
print country_list_subcat

print '\nCountry list extracted from "Lists of companies by country":'
print country_list_page

Country list extracted from "Subcategory" section:
[u'Afghanistan', u'Albania', u'Algeria', u'Andorra', u'Angola', u'Argentina', u'Armenia', u'Australia', u'Austria', u'Azerbaijan', u'Bahamas', u'Bahrain', u'Bangladesh', u'Barbados', u'Belarus', u'Belgium', u'Belize', u'Benin', u'Bermuda', u'Bhutan', u'Bolivia', u'Bosnia and Herzegovina', u'Botswana', u'Brazil', u'Brunei', u'Bulgaria', u'Burkina Faso', u'Burundi', u'Cambodia', u'Cameroon', u'Canada', u'Cape Verde', u'Central African Republic', u'Chad', u'Chile', u'China', u'Colombia', u'Comoros', u'Democratic Republic of the Congo', u'Republic of the Congo', u'Costa Rica', u'Croatia', u'Cuba', u'Cyprus', u'Czech Republic', u'Denmark', u'Djibouti', u'Dominica', u'Dominican Republic', u'Ecuador', u'Egypt', u'El Salvador', u'Equatorial Guinea', u'Estonia', u'Ethiopia', u'Faroe Islands', u'Fiji', u'Finland', u'France', u'Gabon', u'Gambia', u'Georgia (country)', u'Germany', u'Ghana', u'Greece', u'Guatemala', u'Guinea', u'Guyana', u'Haiti', 

In [7]:
print "Countries that do not have a direct wiki pages of all registerd companies:"
# only elements that are in first list but not second
print set(country_list_subcat) - set(country_list_page)

Countries that do not have a direct wiki pages of all registerd companies:
set([u'Bolivia', u'Tonga', u'Maldives', u'Samoa', u'Uruguay', u'Solomon Islands', u'Gambia', u'Madadascar', u'Kyrgyzstan', u'Moldova', u'United Kingdom', u'Suriname', u'Paraguay', u'Fiji', u'Yemen', u'Tajikistan', u'Ecuador'])


In [8]:
# now construct link list for API query
links2d = [l.find_all('a') for l in wiki_countries.find_all('ul')]
links = list(itertools.chain.from_iterable(links2d))

# key: country, value: wikipedia link
country_extend_urls = dict()
for l in links:
    country = l.text.split('List of companies of ')[-1]
    country_extend_urls[country] = l.get('href').split('/wiki/')[-1]

# manually insert links of important wiki pages
country_extend_urls['Moldova'] = 'List_of_banks_in_Moldova'
country_extend_urls['United Kingdom'] = [
    'List_of_banks_in_Guernsey',
    'List_of_banks_in_Jersey',
    'List_of_companies_of_the_Isle_of_Man',
    'List_of_companies_based_in_Bradford',
    'List_of_companies_in_Sheffield',
    'List_of_companies_in_Harrogate',
    'List_of_companies_based_in_Greater_Manchester',
    'List_of_companies_based_in_Newcastle_upon_Tyne',
    'List_of_companies_in_Lincolnshire',
    'List_of_companies_in_the_City_of_Sunderland',
    'List_of_largest_private_companies_in_the_United_Kingdom',
    'List_of_companies_based_in_London']

In [9]:
print "Counries with wiki pages for companies total:", len(country_extend_urls)

Counries with wiki pages for companies total: 176


## 2. Retrieve Links from Country Lists

Query for links on each page that will represent the companies and their nodes. There are some special cases like in Japan where there is a table of links and each table cell also references to the Tokyo Stock Exchange. However, these notes will become isolated and not be part of the major company network.

In [10]:
# wikipedia API parameters to parse sections and the all links in wanted sections
# https://en.wikipedia.org/w/api.php?action=help&modules=parse
baseurl = u'https://en.wikipedia.org/w/api.php'
action = u'action=parse'
dataformat = u'format=json'

# exclude these sections
exclude_sec = set(['External links', 'See also', 'References'])

In [11]:
def get_json_from_url(url):
    r = requests.get(url)
    
    # on HTML error codes
    if r.status_code != 200:
        return None

    # try converting into JSON
    try:
        sec = r.json()
        return sec
    except ValueError:  # includes simplejson.decoder.JSONDecodeError
        print 'WARN: Decoding JSON has failed on:', url
    return None

In [17]:
def get_link_list(link_url, industry, country, comp):
    
    link_content = get_json_from_url(link_url)
    if not link_content or 'parse' not in link_content:
        comp.update({
            c: {
                'is_company': False,
                'industry': [industry],
                'countries': set([country]),
                'wiki_page_id':  link_content['parse']['pageid']}
            })

    # list of links per section
    company_names = [
        x['*'] for x in link_content['parse']['links'] \
        if x['ns'] == 0 and \
        'List of' not in x['*']]

    # check in original wikitext to exclude additional links that aren't companies
    w_text = link_content['parse']['wikitext']['*']
    
    for c in company_names:
        # if company inside table, only column beginning
        table = re.compile(r'\n\|\s*\[\[' + re.escape(c))
        # if company inside list, only bullet items
        bullet = re.compile(r'\n\*\s*\[\[' + re.escape(c))

        if table.search(w_text) or bullet.search(w_text):
            comp.update({
                c: {
                    'is_company': True,
                    'industry': [industry],
                    'countries': set([country]),
                    'wiki_page_id':  link_content['parse']['pageid']
                }})
    return comp

In [18]:
# method to build link list with 2 requests to API
### examples:
# https://en.wikipedia.org/w/api.php?action=parse&page=List_of_companies_of_Australia&prop=sections
# https://en.wikipedia.org/w/api.php?action=parse&page=List_of_companies_of_Australia&section=4&prop=links|wikitext

def get_c_sec_and_links(title, country):
    comp = dict()
    # title already url encoded
    page = u'page={0}'.format(title)
    # concatenate wikipedia query
    sec_url = u'{0}?{1}&{2}&{3}&prop=sections'.format(baseurl, action, page, dataformat)
    r = requests.get(sec_url)
    try:
        sec = r.json()
    except ValueError:  # includes simplejson.decoder.JSONDecodeError
        print 'Decoding JSON has failed on:', url

    # for each section get a list of links
    if 'parse' not in sec:
        print 'No parsable sections in:', sec_url
        return comp
    for s in sec['parse']['sections']:
        # do not take excluded heading sections or TOCs in first order
        if s['line'] in exclude_sec or s['toclevel'] != 1:          
            continue

        # industry is headline when not sorted after A-Z, alphabetic and nor containing "firms"
        if len(s['line']) > 1 and s['line'].isalpha() and 'firms' not in s['line']:
            industry = s['line']
        else:
            industry = 'unknown'

        # new request for links and wikitext
        link_url = u'{0}?{1}&{2}&section={3}&{4}&prop=links|wikitext'.format(
            baseurl, action, page, s['index'], dataformat)
        comp.update(get_link_list(link_url, industry, country, comp))

    # show empty country lists
    if not comp:
        # some countries have no sections, the links and wikitext is requested
        # the wikitext is useful to make sure links are in a bullet list or table
        link_url = u'{0}?{1}&{2}&{3}&prop=links|wikitext'.format(
            baseurl, action, page, dataformat)
        print country, "without any sections:", sec_url, "Requesting:", link_url
        comp.update(get_link_list(link_url, 'unknown', country, comp))
    return comp

In [19]:
# if there are any companies/industries found in multiple countries combine values
def combine_dict_vals(companies, new_companies):
    if new_companies:
        equal_companies = set(new_companies).intersection(companies)
        for c in equal_companies:
            # take old value and combine with new
            for k, v in companies[c].iteritems():
                if 'unknown' not in str(v):
                    if 'unknown' in str(new_companies[c][k]):
                        new_companies[c][k] = v
                    else:
                        if isinstance(new_companies[c][k], set):
                            new_companies[c][k].update(v)
        companies.update(new_companies)
    return companies

In [95]:
%%time
### this parses all countries companies
# get links and combine to one big dictionary with one key per company
companies = dict()
for l in country_extend_urls:
    if isinstance(country_extend_urls[l], list):
        for ele in country_extend_urls[l]:
            companies.update(combine_dict_vals(
                    companies, get_c_sec_and_links(ele, l)))
    else:
        companies.update(combine_dict_vals(
                companies, get_c_sec_and_links(country_extend_urls[l], l)))

Georgia (country) without any sections: https://en.wikipedia.org/w/api.php?action=parse&page=List_of_companies_of_Georgia_(country)&format=json&prop=sections Requesting: https://en.wikipedia.org/w/api.php?action=parse&page=List_of_companies_of_Georgia_(country)&format=json&prop=links|wikitext
Iran without any sections: https://en.wikipedia.org/w/api.php?action=parse&page=List_of_companies_of_Iran&format=json&prop=sections Requesting: https://en.wikipedia.org/w/api.php?action=parse&page=List_of_companies_of_Iran&format=json&prop=links|wikitext
Iceland without any sections: https://en.wikipedia.org/w/api.php?action=parse&page=List_of_companies_of_Iceland&format=json&prop=sections Requesting: https://en.wikipedia.org/w/api.php?action=parse&page=List_of_companies_of_Iceland&format=json&prop=links|wikitext
United Kingdom without any sections: https://en.wikipedia.org/w/api.php?action=parse&page=List_of_banks_in_Guernsey&format=json&prop=sections Requesting: https://en.wikipedia.org/w/api.ph

In [96]:
print 'Companies extracted from Wikipedia:', len(companies.keys())

Companies extracted from Wikipedia: 13793


In [97]:
# companies listed in more than one country
for k, v in companies.iteritems():
    if len(v['country']) > 1 or len(v['industry']) > 1:
        print k, v

PMC-Sierra {'country': set([u'Canada', u'the United States']), 'industry': set([u'Companies']), 'page_id': 76559}
Ericsson {'country': set([u'Sweden', u'Ireland']), 'industry': set([u'Telecommunications']), 'page_id': 654576}
Old Mutual {'country': set(['United Kingdom', u'South Africa']), 'industry': set([u'Financials']), 'page_id': 39857195}
Isle of Man Newspapers {'country': set(['United Kingdom', u'the Isle of Man']), 'industry': set([u'Media']), 'page_id': 31215006}
Royal Bank of Canada {'country': set(['United Kingdom', u'Canada']), 'industry': set([u'Companies']), 'page_id': 16054929}
EuroManx {'country': set(['United Kingdom', u'the Isle of Man']), 'industry': set(['unknown']), 'page_id': 31215006}
Fila (company) {'country': set([u'South Korea', u'Italy']), 'industry': set(['unknown']), 'page_id': 90443}
Volvo {'country': set([u'Belgium', u'Sweden']), 'industry': set([u'Industrials']), 'page_id': 285254}
Shanghai Tang {'country': set([u'China', u'Hong Kong']), 'industry': set([

In [98]:
# store temp company data in one binary file, to avoid reextraction
with open('{0}/tmp_extraction3_data.pkl'.format(directory), 'wb') as f:
    pickle.dump(companies, f)

## 3. Retrieve Links from Company Page

In [99]:
# load data dictionary
with open('{0}/tmp_extraction3_data.pkl'.format(directory), 'rb') as f:
    tmp_com_dat = pickle.load(f)

In [101]:
def get_c_interlinks(url, title, comp):
        
    json_content = get_json_from_url(link_url)

    # wiki entry contains excerpt with unknown pageid
    for page in json_content['query']['pages']:
        if page != '-1':
            # extract the links that match with a company name
            if 'links' not in json_content['query']['pages'][page]:
                print 'No links found at:', url
                continue
            link_list = [
                x['title'] for x in json_content['query']['pages'][page]['links'] \
                if x['title'] in tmp_com_dat]
            if 'links' in comp:
                comp['links'].update(set(link_list))
            else:
                comp['links'] = set(link_list)
    # maximum 500 links per page, additional request might be necessary
    # e.g.: plcontinue=226160|0|Lieutenant_General
    if 'continue' in json_content:
        content_extended = u'{0}&plcontinue={1}'.format(content, json_content['continue']['plcontinue'])
        next_url = u'{0}?{1}&{2}&{3}&{4}'.format(baseurl, action, title, dataformat, content_extended)
        comp = get_c_interlinks(next_url, title, comp)
    return comp

In [102]:
# now get all links from each company page, to construct edges
# wikipedia API parameters
action = u'action=query'
# query maximum limit of links and follow redirects, e.g.: Imtech -> Royal Imtech N.V.
content = u'prop=links&pllimit=max&redirects'

### examples:
# https://en.wikipedia.org/w/api.php?action=query&titles=Volkswagen&prop=links&pllimit=max&redirects
# https://en.wikipedia.org/w/api.php?action=query&titles=Volkswagen&prop=links&pllimit=max&redirects&plcontinue=32413|0|Volkswagen_Golf_GTE

# Volkswagen example of link list
com_name = u'Volkswagen'
title = u'titles={0}'.format(urllib.quote_plus(
        com_name.encode('utf-8')))
# concatenate wikipedia query
url = u'{0}?{1}&{2}&{3}&{4}'.format(baseurl, action, title, dataformat, content)
print "Queried URL:", url
tmp_com_dat[com_name] = get_c_interlinks(url, title, tmp_com_dat[com_name])

print com_name, "excerpt:", tmp_com_dat[com_name]

Queried URL: https://en.wikipedia.org/w/api.php?action=query&titles=Volkswagen&format=json&prop=links&pllimit=max&redirects
Volkswagen excerpt: {'country': set([u'Germany']), 'industry': set(['unknown']), 'page_id': 76581, 'links': set([u'Brabus', u'Daimler AG', u'Ford of Europe', u'SEAT', u'Getrag', u'Rheinmetall', u'Scania AB', u'BP', u'Mercedes-Benz', u'Mahle GmbH', u'Volkswagen Group', u'Capital Group Companies', u'Bugatti Automobiles', u'Ducati', u'BMW', u'Mann+Hummel', u'Hella (company)', u'Magirus', u'Tesla Motors', u'Lamborghini', u'Italdesign Giugiaro', u'Renault', u'Sanyo', u'NSU Motorenwerke AG', u'Giorgetto Giugiaro', u'Gruppo Bertone', u'Krauss-Maffei', u'Voith', u'Robert Bosch GmbH', u'Alpina', u'Hanomag', u'Continental AG', u'Opel', u'D\xfcrr AG'])}


In [103]:
%%time
# get content key: company name, value: json
for com_name in tmp_com_dat.keys():
    # skip already parsed links like Volkswagen from above
    if 'links' in tmp_com_dat[com_name]:
        continue

    title = u'titles={0}'.format(urllib.quote_plus(
            com_name.encode('utf-8')))
    # concatenate wikipedia query
    url = u'{0}?{1}&{2}&{3}&{4}'.format(baseurl, action, title, dataformat, content)
    tmp_com_dat[com_name] = get_c_interlinks(url, title, tmp_com_dat[com_name])

Microsoft excerpt: {'country': set([u'the United States', u'Ireland']), 'industry': set(['unknown']), 'page_id': 654576, 'links': set([u'Toshiba', u'Norwegian Cruise Line', u'Adobe Systems', u'Amgen', u'Micromax Mobile', u'Huawei', u'Applied Materials', u'Comcast', u'Deloitte', u'Seagate Technology', u'NBCUniversal', u'Mylan', u'HP Inc.', u'TSMC', u'AOL', u'Xerox', u'American Express', u'Walmart', u'Booz Allen Hamilton', u'Alphabet Inc.', u'MediaTek', u'CA Technologies', u'Pfizer', u'Merck & Co.', u'Jabil Circuit', u'Silicon Power', u'Havok (company)', u'Konica Minolta', u'Apple Inc.', u'Leidos', u'Sony', u'Xilinx', u'KPMG', u'SanDisk', u'Bharti Airtel', u'Quanta Computer', u'Guardian Media Group', u'Fiserv', u'Visa Inc.', u'Orange S.A.', u'Am\xe9rica M\xf3vil', u'China Mobile', u'Dollar Tree', u'Wistron Corporation', u'DuPont', u'Pegatron', u'Accenture', u'The Travelers Companies', u'Infosys', u'Western Digital', u'Tech Mahindra', u'Paccar', u'STMicroelectronics', u'KT Corporation', u

In [104]:
# store company data in one binary file
with open('{0}/extraction3_data.pkl'.format(directory), 'wb') as f:
    pickle.dump(tmp_com_dat, f)