## Web API based scraping

In this section, we will take a look at an alternative way to gather data then the manual, pattern based, HTML scraping. Sometimes websites offer an API (Application Programming Interface) as a service which provide a high level interface to directly retrieve data from their repositories or databases at the backend. They typically tend to be URL endpoints which need to be modified with a query and fired which then fetches a payload in the response body. 

An API in the context of web scraping would be :
- Requests (through Hypertext Transfer Protocol HTTP
- Headers

talk more here!

E.g.

- For example, Twitter's REST API allows developers to access core Twitter data and the Search API provides methods for developers to interact with Twitter Search and trends data.

https://en.wikipedia.org/w/api.php

There are primarily two ways to use APIs :
- Through the command terminal using URL endpoints, or
- Through programming language specific *wrappers*

For e.g. `Tweepy` is a famous python wrapper for Twitter API whereas `twurl` is a command line interface (CLI) tool but both can achieve the same outcomes.

Here we focus on the latter approach and will use a Python library (a wrapper) called `wptools` based around the MediaWiki API.

API is typically defined as a set of specifications, such as request messages, along with a definition of the structure of response messages, usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format. 

### Wikipedia API

Let's say we want to gather some additional data about the Fortune 500 companies and since wikipedia is a rich source for data we decide to use the MediaWiki API to scrape this data. One very good place to start would be to look at the **infoboxes** (as wikipedia calls them) of articles corresponsing to each company on the list. They essentially contain metadata about a particular entity the article belongs to which in our case is a company. 

For e.g. consider the wikipedia article for **walmart** (https://en.wikipedia.org/wiki/Walmart) which includes the following infobox :

![An infobox](infobox.png)

As we can see from above, the infoboxes could provide us with a lot of valuable information such as :
- Year of founding 
- Industry
- Founder(s)
- Products	
- Services	
- Operating income
- Net income
- Total assets
- Total equity
- Number of employees etc

Although we expect this data to be fairly organized, it would require some post-processing which we will tackle in our next section. We pick a subset of our data and focus only on the top **20** of the Fortune 500 from the full list. 

Let's begin by installing some of libraries we will use for this excercise and import the same as follows,

In [1]:
# sudo apt install libcurl4-openssl-dev libssl-dev
# pip install wptools
# pip install wikipedia
# pip install pandas
# pip install wordcloud

In [4]:
import json
import wptools
import itertools
import wikipedia
import pandas as pd
from pathlib import Path
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from IPython.display import Image

%matplotlib inline
plt.style.use('ggplot')

print(wptools.__version__)

0.4.17


Now let's load the data which we scrapped from the previous section as follows,

In [8]:
fname = 'fortune_500_companies.csv'
path = Path('../data/')
df = pd.read_csv(path/fname)
df.head()

Unnamed: 0,rank,company_name,company_website
0,1,Walmart,http://www.stock.walmart.com
1,2,Exxon Mobil,http://www.exxonmobil.com
2,3,Berkshire Hathaway,http://www.berkshirehathaway.com
3,4,Apple,http://www.apple.com
4,5,UnitedHealth Group,http://www.unitedhealthgroup.com


And select only the top 20 companies from the list,

In [9]:
df_sub = df.iloc[:20, :].copy() # only the top 20 companies
companies = df_sub['company_name'].tolist() 

Let's have a brief look at them

In [7]:
companies

['Walmart',
 'Exxon Mobil',
 'Berkshire Hathaway',
 'Apple',
 'UnitedHealth Group',
 'McKesson',
 'CVS Health',
 'Amazon.com',
 'AT&T',
 'General Motors',
 'Ford Motor',
 'AmerisourceBergen',
 'Chevron',
 'Cardinal Health',
 'Costco',
 'Verizon',
 'Kroger',
 'General Electric',
 'Walgreens Boots Alliance',
 'JPMorgan Chase']

### Getting article names from wiki

A tricky issue with matching the top 20 Fortune 500 companies to the wikipedia article names could be that the strings would have slight variation in them.

To get suggestions for the company names and their equivalent in wikipedia (https://wikipedia.readthedocs.io/en/latest/code.html)

In [8]:
wiki_search = [{company : wikipedia.search(company)} for company in companies]

In [549]:
wiki_search

[{'Walmart': ['Walmart',
   'Criticism of Walmart',
   'Walmarting',
   'Walmart Canada',
   'History of Walmart',
   'List of Walmart brands',
   'Walmart Labs',
   'List of assets owned by Walmart',
   'Walmart de México y Centroamérica',
   'People of Walmart']},
 {'Exxon Mobil': ['ExxonMobil',
   'Exxon',
   'ExxonMobil climate change controversy',
   'Mobil',
   'Exxon Valdez oil spill',
   'Darren Woods',
   'Esso',
   'Exxon Valdez',
   'ExxonMobil Building',
   'List of public corporations by market capitalization']},
 {'Berkshire Hathaway': ['Berkshire Hathaway',
   'List of assets owned by Berkshire Hathaway',
   'Berkshire Hathaway Energy',
   'Berkshire Hathaway Assurance',
   'List of Berkshire Hathaway publications',
   'Berkshire Hathaway GUARD Insurance Companies',
   'Warren Buffett',
   'David L. Sokol',
   'Charlie Munger',
   'Clayton Homes']},
 {'Apple': ['Apple',
   'Apple Inc.',
   'Apple (disambiguation)',
   'Apple TV',
   'Apple Network Server',
   'IPhone',
 

In [550]:
most_probable = [(company, wiki_search[i][company][0]) for i, company in enumerate(companies)]
most_probable

[('Walmart', 'Walmart'),
 ('Exxon Mobil', 'ExxonMobil'),
 ('Berkshire Hathaway', 'Berkshire Hathaway'),
 ('Apple', 'Apple'),
 ('UnitedHealth Group', 'UnitedHealth Group'),
 ('McKesson', 'McKesson Corporation'),
 ('CVS Health', 'CVS Health'),
 ('Amazon.com', 'Amazon (company)'),
 ('AT&T', 'AT&T'),
 ('General Motors', 'General Motors'),
 ('Ford Motor', 'Ford Motor Company'),
 ('AmerisourceBergen', 'AmerisourceBergen'),
 ('Chevron', 'Chevron Corporation'),
 ('Cardinal Health', 'Cardinal Health'),
 ('Costco', 'Costco'),
 ('Verizon', 'Verizon Communications'),
 ('Kroger', 'Kroger'),
 ('General Electric', 'General Electric'),
 ('Walgreens Boots Alliance', 'Walgreens Boots Alliance'),
 ('JPMorgan Chase', 'JPMorgan Chase')]

In [551]:
companies = [x[1] for x in most_probable]
companies

['Walmart',
 'ExxonMobil',
 'Berkshire Hathaway',
 'Apple',
 'UnitedHealth Group',
 'McKesson Corporation',
 'CVS Health',
 'Amazon (company)',
 'AT&T',
 'General Motors',
 'Ford Motor Company',
 'AmerisourceBergen',
 'Chevron Corporation',
 'Cardinal Health',
 'Costco',
 'Verizon Communications',
 'Kroger',
 'General Electric',
 'Walgreens Boots Alliance',
 'JPMorgan Chase']

For **Apple**, lets manually replace it with **Apple Inc.** as follows,

In [552]:
companies[companies.index('Apple')] = 'Apple Inc.'
print(companies)

['Walmart', 'ExxonMobil', 'Berkshire Hathaway', 'Apple Inc.', 'UnitedHealth Group', 'McKesson Corporation', 'CVS Health', 'Amazon (company)', 'AT&T', 'General Motors', 'Ford Motor Company', 'AmerisourceBergen', 'Chevron Corporation', 'Cardinal Health', 'Costco', 'Verizon Communications', 'Kroger', 'General Electric', 'Walgreens Boots Alliance', 'JPMorgan Chase']


> Note : Wiki data dump link (last updated 2015) : https://old.datahub.io/dataset/wikidata

## wptools

- https://github.com/siznax/wptools/wiki/Data-captured

In [13]:
page = wptools.page('Walmart')
page.get_parse()
page.get_wikidata()

en.wikipedia.org (parse) Walmart
en.wikipedia.org (imageinfo) File:Walmart store exterior 5266815680.jpg
Walmart (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': 'File:Walmart s...
  infobox: <dict(30)> name, logo, logo_caption, image, image_size,...
  iwlinks: <list(2)> https://commons.wikimedia.org/wiki/Category:W...
  pageid: 33589
  parsetree: <str(346504)> <root><template><title>about</title><pa...
  requests: <list(2)> parse, imageinfo
  title: Walmart
  wikibase: Q483551
  wikidata_url: https://www.wikidata.org/wiki/Q483551
  wikitext: <str(274081)> {{about|the retail chain|other uses}}{{p...
}
www.wikidata.org (wikidata) Q483551
www.wikidata.org (labels) Q180816|Q219635|P18|Q478758|Q10382887|Q...
www.wikidata.org (labels) P740|Q54862513|P966|P3500|Q6383259|Q694...
www.wikidata.org (labels) Q818364|P6160|P1278|P3347|Q17343056|P37...
en.wikipedia.org (imageinfo) File:Walmart Home Office.jpg
Walmart (en) data
{
  aliases: <list(5)> Wal-Mart, Wal Mart, Wal-Mart Stores

<wptools.page.WPToolsPage at 0x7f1fc48660f0>

In [15]:
page.data.keys()

dict_keys(['requests', 'iwlinks', 'pageid', 'wikitext', 'parsetree', 'infobox', 'title', 'wikibase', 'wikidata_url', 'image', 'labels', 'wikidata', 'wikidata_pageid', 'aliases', 'modified', 'description', 'label', 'claims', 'what'])

Alternatively,

In [17]:
page.data['wikidata']

{'founded by (P112)': 'Sam Walton (Q497827)',
 'ISIN (P946)': 'US9311421039',
 'Commons category (P373)': 'Walmart',
 'instance of (P31)': ['retail chain (Q507619)', 'enterprise (Q6881511)'],
 'official website (P856)': 'https://www.walmart.com',
 "topic's main category (P910)": 'Category:Walmart (Q6383259)',
 'headquarters location (P159)': ['Bentonville (Q818364)', 'Arkansas (Q1612)'],
 'stock exchange (P414)': 'New York Stock Exchange (Q13677)',
 'subsidiary (P355)': ["Sam's Club (Q1972120)",
  'Massmart (Q3297791)',
  'Walmart Canada (Q1645718)',
  'Walmart Chile (Q5283104)',
  'Walmart de México y Centroamérica (Q1064887)',
  'Seiyu Group (Q3108542)',
  'Asda (Q297410)',
  'Walmart Labs (Q3816562)',
  'Walmart (Q30338489)',
  'Más Club (Q6949810)',
  'Líder (Q6711261)',
  'Hypermart USA (Q16845747)',
  'Amigo Supermarkets (Q4746234)',
  'Walmart Neighborhood Market (Q7963529)',
  'Asda Mobile (Q4804093)',
  'Marketside (Q6770960)',
  'Vudu (Q5371838)',
  'Walmart Nicaragua (Q22121

In [378]:
wiki_data = []
# attributes of interest contained within the wiki infoboxes
features = ['founder', 'location_country', 'revenue', 'operating_income', 'net_income', 'assets',
        'equity', 'type', 'industry', 'products', 'num_employees']

Now lets fetch results for all the companies as follows,

In [380]:
for company in companies:    
    page = wptools.page(company)
    try:
        page.get_parse()
        if page.data['infobox'] != None:
            infobox = page.data['infobox']
            data = { feature : infobox[feature] if feature in infobox else '' 
                         for feature in features }
        else:
            data = { feature : '' for feature in features }
        
        data['company_name'] = company
        wiki_data.append(data)
        
    except KeyError:
        pass

en.wikipedia.org (parse) Walmart
en.wikipedia.org (imageinfo) File:Walmart store exterior 5266815680.jpg
Walmart (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': 'File:Walmart s...
  infobox: <dict(30)> name, logo, logo_caption, image, image_size,...
  iwlinks: <list(2)> https://commons.wikimedia.org/wiki/Category:W...
  pageid: 33589
  parsetree: <str(346504)> <root><template><title>about</title><pa...
  requests: <list(2)> parse, imageinfo
  title: Walmart
  wikibase: Q483551
  wikidata_url: https://www.wikidata.org/wiki/Q483551
  wikitext: <str(274081)> {{about|the retail chain|other uses}}{{p...
}
en.wikipedia.org (parse) ExxonMobil
en.wikipedia.org (imageinfo) File:ExxonMobilBuilding.JPG
ExxonMobil (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': 'File:ExxonMobi...
  infobox: <dict(29)> name, logo, image, image_caption, type, trad...
  iwlinks: <list(3)> https://commons.wikimedia.org/wiki/Category:E...
  pageid: 18848197
  parsetree: <str(187433)> <root

en.wikipedia.org (imageinfo) File:Verizon Building (8156005279).jpg
Verizon Communications (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': 'File:Verizon B...
  infobox: <dict(30)> name, logo, image, image_caption, former_nam...
  iwlinks: <list(3)> https://commons.wikimedia.org/wiki/Category:T...
  pageid: 18619278
  parsetree: <str(147152)> <root><template><title>short descriptio...
  requests: <list(2)> parse, imageinfo
  title: Verizon Communications
  wikibase: Q467752
  wikidata_url: https://www.wikidata.org/wiki/Q467752
  wikitext: <str(124812)> {{short description|American communicati...
}
en.wikipedia.org (parse) Kroger
en.wikipedia.org (imageinfo) File:Cincinnati-kroger-building.jpg
Kroger (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': 'File:Cincinnat...
  infobox: <dict(24)> name, logo, image, image_caption, type, trad...
  iwlinks: <list(1)> https://commons.wikimedia.org/wiki/Category:Kroger
  pageid: 367762
  parsetree: <str(121519)> <root><te

In [381]:
wiki_data

[{'founder': '[[Sam Walton]]',
  'location_country': 'U.S.',
  'revenue': '{{increase}} {{US$|514.405 billion|link|=|yes}} (2019)',
  'operating_income': '{{increase}} {{US$|21.957 billion}} (2019)',
  'net_income': '{{decrease}} {{US$|6.67 billion}} (2019)',
  'assets': '{{increase}} {{US$|219.295 billion}} (2019)',
  'equity': '{{decrease}} {{US$|79.634 billion}} (2019)',
  'type': '[[Public company|Public]]',
  'industry': '[[Retail]]',
  'products': '{{hlist|Electronics|Movies and music|Home and furniture|Home improvement|Clothing|Footwear|Jewelry|Toys|Health and beauty|Pet supplies|Sporting goods and fitness|Auto|Photo finishing|Craft supplies|Party supplies|Grocery}}',
  'num_employees': '{{plainlist|\n* 2.2|nbsp|million, Worldwide (2018)|ref| name="xbrlus_1" |\n* 1.5|nbsp|million, U.S. (2017)|ref| name="Walmart"|{{cite web |url = http://corporate.walmart.com/our-story/locations/united-states |title = Walmart Locations Around the World – United States |publisher = |url-status=liv

Finally, let's export all the scapped infoboxes as a single JSON file to a convenient location as follows,

In [382]:
with open('../data/infoboxes.json', 'w') as file:
    json.dump(wiki_data, file)

Import :

In [332]:
with open('../data/infoboxes.json', 'r') as file:
    wiki_data = json.load(file)