## Web API based scraping

### A brief introduction to APIs

In this section, we will take a look at an alternative way to gather data than the previous pattern based, HTML scraping. Sometimes websites offer an API (or Application Programming Interface) as a service which provides a high level interface to directly retrieve data from their repositories or databases at the backend. 

From wikipedia,

> An API is typically defined as a set of specifications, such as Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format.

They typically tend to be URL endpoints (to be fired as requests) that need to be modified based on our requirements (what we desire in the response body) which then returns some a payload (data) within the response, formatted as either JSON, XML or HTML. 

A popular web architecture style called REST (or representational state transfer) allows users to interact with web services via `GET` and `POST` calls (two most commonly used).

An API in the context of web scraping would be :
- Requests (through Hypertext Transfer Protocol HTTP
- Headers

talk more here!

E.g.

- For example, Twitter's REST API allows developers to access core Twitter data and the Search API provides methods for developers to interact with Twitter Search and trends data.

https://en.wikipedia.org/w/api.php

There are primarily two ways to use APIs :
- Through the command terminal using URL endpoints, or
- Through programming language specific *wrappers*

For e.g. `Tweepy` is a famous python wrapper for Twitter API whereas `twurl` is a command line interface (CLI) tool but both can achieve the same outcomes.

Here we focus on the latter approach and will use a Python library (a wrapper) called `wptools` based around the MediaWiki API.

One advantage of using official APIs is that they are usually compliant of the terms of service (ToS) of a particular service that researchers are looking to gather data from. However, third-party libraries or packages which claim to provide more throughput than the official APIs (rate limits, number of requests/sec) generally operate in a gray area as they tend to violate ToS.

### Wikipedia API

Let's say we want to gather some additional data about the Fortune 500 companies and since wikipedia is a rich source for data we decide to use the MediaWiki API to scrape this data. One very good place to start would be to look at the **infoboxes** (as wikipedia defines them) of articles corresponsing to each company on the list. They essentially contain a wealth of metadata about a particular entity the article belongs to which in our case is a company. 

For e.g. consider the wikipedia article for **walmart** (https://en.wikipedia.org/wiki/Walmart) which includes the following infobox :

![An infobox](infobox.png)

As we can see from above, the infoboxes could provide us with a lot of valuable information such as :
- Year of founding 
- Industry
- Founder(s)
- Products	
- Services	
- Operating income
- Net income
- Total assets
- Total equity
- Number of employees etc

Although we expect this data to be fairly organized, it would require some post-processing which we will tackle in our next section. We pick a subset of our data and focus only on the top **20** of the Fortune 500 from the full list. 

Let's begin by installing some of libraries we will use for this excercise as follows,

In [1]:
# sudo apt install libcurl4-openssl-dev libssl-dev
!pip install wptools
!pip install wikipedia
# pip install pandas
!pip install wordcloud

Importing the same,

In [1]:
import json
import wptools
import itertools
import wikipedia
import pandas as pd
from pathlib import Path
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from IPython.display import Image

%matplotlib inline
plt.style.use('ggplot')              # setting the style to ggplot

print(wptools.__version__)           # checking the installed version

0.4.17


Now let's load the data which we scrapped in the previous section as follows,

In [2]:
fname = 'fortune_500_companies.csv' # filename
path = Path('../data/')             # path to the csv file
df = pd.read_csv(path/fname)        # reading the csv file as a pandas df
df.head()                           # displaying the first 5 rows

Unnamed: 0,rank,company_name,company_website
0,1,Walmart,http://www.stock.walmart.com
1,2,Exxon Mobil,http://www.exxonmobil.com
2,3,Berkshire Hathaway,http://www.berkshirehathaway.com
3,4,Apple,http://www.apple.com
4,5,UnitedHealth Group,http://www.unitedhealthgroup.com


Let's focus and select only the top 20 companies from the list as follows,

In [3]:
no_of_companies = 20                         # no of companies we are interested 
df_sub = df.iloc[:no_of_companies, :].copy() # only selecting the top 20 companies
companies = df_sub['company_name'].tolist()  # converting the column to a list

Now let's take a brief look as follows,

In [29]:
for i, j in enumerate(companies):   # looping through the list of 20 company 
    print('{}. {}'.format(i+1, j))  # printing out the same

1. Walmart
2. Exxon Mobil
3. Berkshire Hathaway
4. Apple
5. UnitedHealth Group
6. McKesson
7. CVS Health
8. Amazon.com
9. AT&T
10. General Motors
11. Ford Motor
12. AmerisourceBergen
13. Chevron
14. Cardinal Health
15. Costco
16. Verizon
17. Kroger
18. General Electric
19. Walgreens Boots Alliance
20. JPMorgan Chase


### Getting article names from wiki

Right off the bat, as you might have guessed, a tricky issue with matching the top 20 Fortune 500 companies to their wikipedia article names is that both of them would not be exactly the same i.e. they match character for character.

To overcome this problem and ensure that we have all the company names and its corresponding wikipedia article, we will use (https://wikipedia.readthedocs.io/en/latest/code.html) to get suggestions for the company names and their equivalent in wikipedia.

In [30]:
wiki_search = [{company : wikipedia.search(company)} for company in companies]

In [49]:
for idx, company in enumerate(wiki_search):
    for i, j in company.items():
        print('{}. {} :\n{}'.format(idx+1, i ,', '.join(j)))
        print('\n')

1. Walmart :
Walmart, Criticism of Walmart, History of Walmart, Walmarting, Walmart Canada, Walmart Labs, People of Walmart, List of Walmart brands, Walmart (disambiguation), Walmart Watch


2. Exxon Mobil :
ExxonMobil, Exxon, ExxonMobil climate change controversy, Mobil, ExxonMobil Building, 2020 Qatar ExxonMobil Open, Darren Woods, ExxonMobil Tower, Exxon Valdez oil spill, Exxon Valdez


3. Berkshire Hathaway :
Berkshire Hathaway, List of assets owned by Berkshire Hathaway, Berkshire Hathaway Energy, Berkshire Hathaway Assurance, Berkshire Hathaway GUARD Insurance Companies, List of Berkshire Hathaway publications, Warren Buffett, Ajit Jain, Berkshire Hathaway Travel Protection, The World's Billionaires


4. Apple :
Apple, Apple Inc., Apple (disambiguation), IPhone, Apple Music, Apple A13, Apple TV, Apple ID, Apple Watch, Apple Records


5. UnitedHealth Group :
UnitedHealth Group, Pharmacy benefit management, Optum, List of largest companies in the United States by revenue, William W

In [50]:
most_probable = [(company, wiki_search[i][company][0]) for i, company in enumerate(companies)]
most_probable

[('Walmart', 'Walmart'),
 ('Exxon Mobil', 'ExxonMobil'),
 ('Berkshire Hathaway', 'Berkshire Hathaway'),
 ('Apple', 'Apple'),
 ('UnitedHealth Group', 'UnitedHealth Group'),
 ('McKesson', 'McKesson Corporation'),
 ('CVS Health', 'CVS Health'),
 ('Amazon.com', 'Amazon (company)'),
 ('AT&T', 'AT&T'),
 ('General Motors', 'General Motors'),
 ('Ford Motor', 'Ford Motor Company'),
 ('AmerisourceBergen', 'AmerisourceBergen'),
 ('Chevron', 'Chevron Corporation'),
 ('Cardinal Health', 'Cardinal Health'),
 ('Costco', 'Costco'),
 ('Verizon', 'Verizon Communications'),
 ('Kroger', 'Kroger'),
 ('General Electric', 'General Electric'),
 ('Walgreens Boots Alliance', 'Walgreens Boots Alliance'),
 ('JPMorgan Chase', 'JPMorgan Chase')]

In [551]:
companies = [x[1] for x in most_probable]
companies

['Walmart',
 'ExxonMobil',
 'Berkshire Hathaway',
 'Apple',
 'UnitedHealth Group',
 'McKesson Corporation',
 'CVS Health',
 'Amazon (company)',
 'AT&T',
 'General Motors',
 'Ford Motor Company',
 'AmerisourceBergen',
 'Chevron Corporation',
 'Cardinal Health',
 'Costco',
 'Verizon Communications',
 'Kroger',
 'General Electric',
 'Walgreens Boots Alliance',
 'JPMorgan Chase']

For **Apple**, lets manually replace it with **Apple Inc.** as follows,

In [552]:
companies[companies.index('Apple')] = 'Apple Inc.'
print(companies)

['Walmart', 'ExxonMobil', 'Berkshire Hathaway', 'Apple Inc.', 'UnitedHealth Group', 'McKesson Corporation', 'CVS Health', 'Amazon (company)', 'AT&T', 'General Motors', 'Ford Motor Company', 'AmerisourceBergen', 'Chevron Corporation', 'Cardinal Health', 'Costco', 'Verizon Communications', 'Kroger', 'General Electric', 'Walgreens Boots Alliance', 'JPMorgan Chase']


> Note : Wiki data dump link (last updated 2015) : https://old.datahub.io/dataset/wikidata

## wptools

- https://github.com/siznax/wptools/wiki/Data-captured

In [13]:
page = wptools.page('Walmart')
page.get_parse()
page.get_wikidata()

en.wikipedia.org (parse) Walmart
en.wikipedia.org (imageinfo) File:Walmart store exterior 5266815680.jpg
Walmart (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': 'File:Walmart s...
  infobox: <dict(30)> name, logo, logo_caption, image, image_size,...
  iwlinks: <list(2)> https://commons.wikimedia.org/wiki/Category:W...
  pageid: 33589
  parsetree: <str(346504)> <root><template><title>about</title><pa...
  requests: <list(2)> parse, imageinfo
  title: Walmart
  wikibase: Q483551
  wikidata_url: https://www.wikidata.org/wiki/Q483551
  wikitext: <str(274081)> {{about|the retail chain|other uses}}{{p...
}
www.wikidata.org (wikidata) Q483551
www.wikidata.org (labels) Q180816|Q219635|P18|Q478758|Q10382887|Q...
www.wikidata.org (labels) P740|Q54862513|P966|P3500|Q6383259|Q694...
www.wikidata.org (labels) Q818364|P6160|P1278|P3347|Q17343056|P37...
en.wikipedia.org (imageinfo) File:Walmart Home Office.jpg
Walmart (en) data
{
  aliases: <list(5)> Wal-Mart, Wal Mart, Wal-Mart Stores

<wptools.page.WPToolsPage at 0x7f1fc48660f0>

In [15]:
page.data.keys()

dict_keys(['requests', 'iwlinks', 'pageid', 'wikitext', 'parsetree', 'infobox', 'title', 'wikibase', 'wikidata_url', 'image', 'labels', 'wikidata', 'wikidata_pageid', 'aliases', 'modified', 'description', 'label', 'claims', 'what'])

Alternatively,

In [17]:
page.data['wikidata']

{'founded by (P112)': 'Sam Walton (Q497827)',
 'ISIN (P946)': 'US9311421039',
 'Commons category (P373)': 'Walmart',
 'instance of (P31)': ['retail chain (Q507619)', 'enterprise (Q6881511)'],
 'official website (P856)': 'https://www.walmart.com',
 "topic's main category (P910)": 'Category:Walmart (Q6383259)',
 'headquarters location (P159)': ['Bentonville (Q818364)', 'Arkansas (Q1612)'],
 'stock exchange (P414)': 'New York Stock Exchange (Q13677)',
 'subsidiary (P355)': ["Sam's Club (Q1972120)",
  'Massmart (Q3297791)',
  'Walmart Canada (Q1645718)',
  'Walmart Chile (Q5283104)',
  'Walmart de México y Centroamérica (Q1064887)',
  'Seiyu Group (Q3108542)',
  'Asda (Q297410)',
  'Walmart Labs (Q3816562)',
  'Walmart (Q30338489)',
  'Más Club (Q6949810)',
  'Líder (Q6711261)',
  'Hypermart USA (Q16845747)',
  'Amigo Supermarkets (Q4746234)',
  'Walmart Neighborhood Market (Q7963529)',
  'Asda Mobile (Q4804093)',
  'Marketside (Q6770960)',
  'Vudu (Q5371838)',
  'Walmart Nicaragua (Q22121

In [378]:
wiki_data = []
# attributes of interest contained within the wiki infoboxes
features = ['founder', 'location_country', 'revenue', 'operating_income', 'net_income', 'assets',
        'equity', 'type', 'industry', 'products', 'num_employees']

Now lets fetch results for all the companies as follows,

In [380]:
for company in companies:    
    page = wptools.page(company)
    try:
        page.get_parse()
        if page.data['infobox'] != None:
            infobox = page.data['infobox']
            data = { feature : infobox[feature] if feature in infobox else '' 
                         for feature in features }
        else:
            data = { feature : '' for feature in features }
        
        data['company_name'] = company
        wiki_data.append(data)
        
    except KeyError:
        pass

en.wikipedia.org (parse) Walmart
en.wikipedia.org (imageinfo) File:Walmart store exterior 5266815680.jpg
Walmart (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': 'File:Walmart s...
  infobox: <dict(30)> name, logo, logo_caption, image, image_size,...
  iwlinks: <list(2)> https://commons.wikimedia.org/wiki/Category:W...
  pageid: 33589
  parsetree: <str(346504)> <root><template><title>about</title><pa...
  requests: <list(2)> parse, imageinfo
  title: Walmart
  wikibase: Q483551
  wikidata_url: https://www.wikidata.org/wiki/Q483551
  wikitext: <str(274081)> {{about|the retail chain|other uses}}{{p...
}
en.wikipedia.org (parse) ExxonMobil
en.wikipedia.org (imageinfo) File:ExxonMobilBuilding.JPG
ExxonMobil (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': 'File:ExxonMobi...
  infobox: <dict(29)> name, logo, image, image_caption, type, trad...
  iwlinks: <list(3)> https://commons.wikimedia.org/wiki/Category:E...
  pageid: 18848197
  parsetree: <str(187433)> <root

en.wikipedia.org (imageinfo) File:Verizon Building (8156005279).jpg
Verizon Communications (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': 'File:Verizon B...
  infobox: <dict(30)> name, logo, image, image_caption, former_nam...
  iwlinks: <list(3)> https://commons.wikimedia.org/wiki/Category:T...
  pageid: 18619278
  parsetree: <str(147152)> <root><template><title>short descriptio...
  requests: <list(2)> parse, imageinfo
  title: Verizon Communications
  wikibase: Q467752
  wikidata_url: https://www.wikidata.org/wiki/Q467752
  wikitext: <str(124812)> {{short description|American communicati...
}
en.wikipedia.org (parse) Kroger
en.wikipedia.org (imageinfo) File:Cincinnati-kroger-building.jpg
Kroger (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': 'File:Cincinnat...
  infobox: <dict(24)> name, logo, image, image_caption, type, trad...
  iwlinks: <list(1)> https://commons.wikimedia.org/wiki/Category:Kroger
  pageid: 367762
  parsetree: <str(121519)> <root><te

In [381]:
wiki_data

[{'founder': '[[Sam Walton]]',
  'location_country': 'U.S.',
  'revenue': '{{increase}} {{US$|514.405 billion|link|=|yes}} (2019)',
  'operating_income': '{{increase}} {{US$|21.957 billion}} (2019)',
  'net_income': '{{decrease}} {{US$|6.67 billion}} (2019)',
  'assets': '{{increase}} {{US$|219.295 billion}} (2019)',
  'equity': '{{decrease}} {{US$|79.634 billion}} (2019)',
  'type': '[[Public company|Public]]',
  'industry': '[[Retail]]',
  'products': '{{hlist|Electronics|Movies and music|Home and furniture|Home improvement|Clothing|Footwear|Jewelry|Toys|Health and beauty|Pet supplies|Sporting goods and fitness|Auto|Photo finishing|Craft supplies|Party supplies|Grocery}}',
  'num_employees': '{{plainlist|\n* 2.2|nbsp|million, Worldwide (2018)|ref| name="xbrlus_1" |\n* 1.5|nbsp|million, U.S. (2017)|ref| name="Walmart"|{{cite web |url = http://corporate.walmart.com/our-story/locations/united-states |title = Walmart Locations Around the World – United States |publisher = |url-status=liv

Finally, let's export all the scapped infoboxes as a single JSON file to a convenient location as follows,

In [382]:
with open('../data/infoboxes.json', 'w') as file:
    json.dump(wiki_data, file)

Import :

In [332]:
with open('../data/infoboxes.json', 'r') as file:
    wiki_data = json.load(file)

### References

- https://phpenthusiast.com/blog/what-is-rest-api