<img src="https://github.com/MonashDataFluency/python-web-scraping/blob/master/images/api.png?raw=1">

### A brief introduction to APIs
---

Sometimes websites offer an API (or Application Programming Interface) as a service which provides a high level interface to directly retrieve data from their repositories or databases at the backend. 

From Wikipedia,

> "*An API is typically defined as a set of specifications, such as Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format.*"

They typically tend to be URL endpoints (to be fired as requests) that need to be modified based on our requirements (what we desire in the response body) which then returns some a payload (data) within the response, formatted as either JSON, XML or HTML. 

A popular web architecture style called `REST` (or representational state transfer) allows users to interact with web services via `GET` and `POST` calls (two most commonly used) which we briefly saw in the previous section.

For example, Twitter's REST API allows developers to access core Twitter data and the Search API provides methods for developers to interact with Twitter Search and trends data.

There are primarily two ways to use APIs :

- Through the command terminal using URL endpoints, or
- Through programming language specific *wrappers*

For example, `Tweepy` is a famous python wrapper for Twitter API whereas `twurl` is a command line interface (CLI) tool but both can achieve the same outcomes.

Here we focus on the latter approach and will use a Python library (a wrapper) called `wptools` based around the original MediaWiki API.

One advantage of using official APIs is that they are usually compliant of the terms of service (ToS) of a particular service that researchers are looking to gather data from. However, third-party libraries or packages which claim to provide more throughput than the official APIs (rate limits, number of requests/sec) generally operate in a gray area as they tend to violate ToS. Always be sure to read their documentation throughly.

### Wikipedia API

Let's say we want to gather some additional data about the Fortune 500 companies and since wikipedia is a rich source for data we decide to use the MediaWiki API to scrape this data. One very good place to start would be to look at the **infoboxes** (as wikipedia defines them) of articles corresponsing to each company on the list. They essentially contain a wealth of metadata about a particular entity the article belongs to which in our case is a company. 

For e.g. consider the wikipedia article for **Walmart** (https://en.wikipedia.org/wiki/Walmart) which includes the following infobox :

![An infobox](https://github.com/MonashDataFluency/python-web-scraping/blob/master/images/infobox.png?raw=1)

As we can see from above, the infoboxes could provide us with a lot of valuable information such as :

- Year of founding 
- Industry
- Founder(s)
- Products	
- Services	
- Operating income
- Net income
- Total assets
- Total equity
- Number of employees etc

Although we expect this data to be fairly organized, it would require some post-processing which we will tackle in our next section. We pick a subset of our data and focus only on the top **20** of the Fortune 500 from the full list. 

Let's begin by installing some of libraries we will use for this excercise as follows,

In [2]:
import pandas as pd
pd.set_option('display.max_rows', 40)
import numpy as np
import matplotlib
import imageio as io
import matplotlib.pyplot as plt
import wptools
import requests
from PIL import Image
from io import BytesIO
from glob import glob
import os
import shutil
import time
import json
print('imported!')

ModuleNotFoundError: No module named 'wptools'

*readng and writing JSON files in python [Stackabuse](http://stackabuse.com/reading-and-writing-json-to-a-file-in-python/)*

In [None]:
# first create the bestofrt_posters f it doesn;t exist

folder_name = './datasets/bestofrt_posters'

if not os.path.exists(folder_name):
    os.mkdir(folder_name)
    print('Folder created!')
else:
    print('folder exists')

In [None]:
os.listdir('datasets')

In [None]:
title_list = [
 'The_Wizard_of_Oz_(1939_film)',
 'Citizen_Kane',
 'The_Third_Man',
 'Get_Out_(film)',
 'Mad_Max:_Fury_Road',
 'The_Cabinet_of_Dr._Caligari',
 'All_About_Eve',
 'Inside_Out_(2015_film)',
 'The_Godfather',
 'Metropolis_(1927_film)',
 'E.T._the_Extra-Terrestrial',
 'Modern_Times_(film)',
 'It_Happened_One_Night',
 "Singin'_in_the_Rain",
 'Boyhood_(film)',
 'Casablanca_(film)',
 'Moonlight_(2016_film)',
 'Psycho_(1960_film)',
 'Laura_(1944_film)',
 'Nosferatu',
 'Snow_White_and_the_Seven_Dwarfs_(1937_film)',
 "A_Hard_Day%27s_Night_(film)",
 "A_Hard_Day's_Night",
 'La_Grande_Illusion',
 'North_by_Northwest',
 'The_Battle_of_Algiers',
 'Dunkirk_(2017_film)',
 'The_Maltese_Falcon_(1941_film)',
 'Repulsion_(film)',
 '12_Years_a_Slave_(film)',
 'Gravity_(2013_film)',
 'Sunset_Boulevard_(film)',
 'King_Kong_(1933_film)',
 'Spotlight_(film)',
 'The_Adventures_of_Robin_Hood',
 'Rashomon',
 'Rear_Window',
 'Selma_(film)',
 'Taxi_Driver',
 'Toy_Story_3',
 'Argo_(2012_film)',
 'Toy_Story_2',
 'The_Big_Sick',
 'Bride_of_Frankenstein',
 'Zootopia',
 'M_(1931_film)',
 'Wonder_Woman_(2017_film)',
 'The_Philadelphia_Story_(film)',
 'Alien_(film)',
 'Bicycle_Thieves',
 'Seven_Samurai',
 'The_Treasure_of_the_Sierra_Madre_(film)',
 'Up_(2009_film)',
 '12_Angry_Men_(1957_film)',
 'The_400_Blows',
 'Logan_(film)',
 'All_Quiet_on_the_Western_Front_(1930_film)',
 'Army_of_Shadows',
 'Arrival_(film)',
 'Baby_Driver',
 'A_Streetcar_Named_Desire_(1951_film)',
 'The_Night_of_the_Hunter_(film)',
 'Star_Wars:_The_Force_Awakens',
 'Manchester_by_the_Sea_(film)',
 'Dr._Strangelove',
 'Frankenstein_(1931_film)',
 'Vertigo_(film)',
 'The_Dark_Knight_(film)',
 'Touch_of_Evil',
 'The_Babadook',
 'The_Conformist_(film)',
 'Rebecca_(1940_film)',
 "Rosemary%27s_Baby_(film)",
 'Finding_Nemo',
 'Brooklyn_(film)',
 'The_Wrestler_(2008_film)',
 'The_39_Steps_(1935_film)',
 'L.A._Confidential_(film)',
 'Gone_with_the_Wind_(film)',
 'The_Good,_the_Bad_and_the_Ugly',
 'Skyfall',
 'Rome,_Open_City',
 'Tokyo_Story',
 'Hell_or_High_Water_(film)',
 'Pinocchio_(1940_film)',
 'The_Jungle_Book_(2016_film)',
 'La_La_Land_(film)',
 'Star_Trek_(film)',
 'High_Noon',
 'Apocalypse_Now',
 'On_the_Waterfront',
 'The_Wages_of_Fear',
 'The_Last_Picture_Show',
 'Harry_Potter_and_the_Deathly_Hallows_–_Part_2',
 'The_Grapes_of_Wrath_(film)',
 'Roman_Holiday',
 'Man_on_Wire',
 'Jaws_(film)',
 'Toy_Story',
 'The_Godfather_Part_II',
 'Battleship_Potemkin'
]

In [None]:
# List of dictionaries to build and convert to a DataFrame later
df_list = []

In [None]:
def get_poster_url(title):
    page = wptools.page(title, silent=True).get()
    imgs = page.data['image']
    url = imgs[0]['url']
    return imgs, url

In [None]:
def download_imgs(first_image_url):
    # Download movie poster image using imageio
    # from HTTPs
    frames = io.imread(first_image_url)
    image_file_format = first_image_url.split('.')[-1]
    # save the image
    img_name = folder_name + "/" + str(ranking) + "_" + title + '.' + image_file_format
    matplotlib.image.imsave(img_name, frames)

In [None]:
def download_img(first_image_url):
    # Download movie poster image PIL
    r = requests.get(first_image_url)
    i = Image.open(BytesIO(r.content))
    image_file_format = first_image_url.split('.')[-1]
    i.save(folder_name + "/" + str(ranking) + "_" + title + '.' + image_file_format)

In [None]:
# define a dctionary for holdng images that had issues downloading

image_errors = {}

In [None]:
if False:
    for title in title_list:
        try:
            # This cell is slow so print ranking to gauge time remaining
            ranking = title_list.index(title) + 1
            print(ranking)

            # First image is usually the poster
            images, first_image_url = get_poster_url(title)

            # Download images
            download_imgs(first_image_url)

            # Append to list of dictionaries
            df_list.append({'ranking': int(ranking),
                            'title': title,
                            'poster_url': first_image_url})

        # Not best practice to catch all exceptions but fine for this short script
        except Exception as e:
            print(str(ranking) + "_" + title + ": " + str(e))
            image_errors[str(ranking) + "_" + title] = images

In [None]:
for img in image_errors.keys():
    print(img)

In [None]:
# save the error dict as a json object

with open('img_errors.json', 'w') as fp:
    json.dump(image_errors, fp)

In [None]:
# Create DataFrame from list of dictionaries
df = pd.DataFrame(df_list, columns = ['ranking', 'title', 'poster_url'])
df = df.sort_values('ranking').reset_index(drop=True)
df

In [None]:
df.to_csv('ebert_imgs.csv', index=False)