 Data Collection

### DEMO 1: Data Collection using APIs

### Objectives 
- Learn about APIs
- Learn JSON format
- Use Google API Books

### What is an API?

- HTML, CSS and javascript create a human-readable webpages.
- It is useful to implement applications that can consume the data on these webpages.
- You can scrape webpages, but, it’s a tedious task!
- So, APIs provide machine-readable data for these software, mainly in the form of **JSON responses** (we will shortly learn more about JSON files).
- Companies such as Google, Twitter, Facebook, and Tumblr provide access to their data for software developers via their APIs

API stands for *Application Programming Interfaces*, many website provide their data through the API so they can control how and when their data is being used. 

### How an API works?

Simply, you send **requests** to a remote server and the remote server **respond** by sending the data back to you.

Then, what we need is to:

1) Establish our connection to the server 

2) Send requests

3) Receive responses 

4) Manipulate responses


### How does the API work? 

APIs use the HTTP (Hyper-Text Transfer Protocol) which is based on a request-response cycle

<img src="images/apis.png" width=60%>
From: https://zapier.com/learn/apis/chapter-2-protocols/

### Request structure:

<img src="images/request.png" width=60%>

### Response structure:

<img src="images/response.png" width=60%>

Complete list of HTTP status codes: https://www.restapitutorial.com/httpstatuscodes.html

## Part 1: Using Requests Library

Requests is a simple Python HTTP library. It provides methods for accessing Web resources via HTTP. Requests allows you to send HTTP/1.1 requests, without the need for a lot of work. 
It is avialble within Anaconda. 

In [1]:
import requests

In [2]:
response = requests.get('https://google.com')

In [3]:
print(response.status_code)
print(response.headers)
print(response.text)

200
{'Date': 'Sun, 18 Aug 2019 13:05:14 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."', 'Content-Encoding': 'gzip', 'Server': 'gws', 'X-XSS-Protection': '0', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': '1P_JAR=2019-08-18-13; expires=Tue, 17-Sep-2019 13:05:14 GMT; path=/; domain=.google.com, NID=188=mQhi7ZWkjyHKaMwJjtxJsAlCDwanBRhNV6m51ZMjScBelGYbE-xYyDLtXvS1hRcbI8MuNbONjXmoRMqYs82NBKe3SAnYvKxwoDfRBa3yEPotVv_LNxO9NzFQtDXc_4ks93oAXd8kP0twTLK3QCkyUrF9NBkTEIUz6BfYRHR2_e8; expires=Mon, 17-Feb-2020 13:05:14 GMT; path=/; domain=.google.com; HttpOnly', 'Alt-Svc': 'quic=":443"; ma=2592000; v="46,43,39"', 'Transfer-Encoding': 'chunked'}
<!doctype html><html dir="rtl" itemscope="" itemtype="http://schema.org/WebPage" lang="ar-SA"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard

In [4]:
response.headers

{'Date': 'Sun, 18 Aug 2019 13:05:14 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."', 'Content-Encoding': 'gzip', 'Server': 'gws', 'X-XSS-Protection': '0', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': '1P_JAR=2019-08-18-13; expires=Tue, 17-Sep-2019 13:05:14 GMT; path=/; domain=.google.com, NID=188=mQhi7ZWkjyHKaMwJjtxJsAlCDwanBRhNV6m51ZMjScBelGYbE-xYyDLtXvS1hRcbI8MuNbONjXmoRMqYs82NBKe3SAnYvKxwoDfRBa3yEPotVv_LNxO9NzFQtDXc_4ks93oAXd8kP0twTLK3QCkyUrF9NBkTEIUz6BfYRHR2_e8; expires=Mon, 17-Feb-2020 13:05:14 GMT; path=/; domain=.google.com; HttpOnly', 'Alt-Svc': 'quic=":443"; ma=2592000; v="46,43,39"', 'Transfer-Encoding': 'chunked'}

### Lets try something real!

We will use Google Books APIs

https://developers.google.com/books/

You can find the reference and other examples here: https://developers.google.com/books/docs/v1/reference/

Copy and paste the following URL in a browser:
    
    https://www.googleapis.com/books/v1/volumes?q=isbn:1860462979

What can you see?

Google books API send requests to get information about books using their isbn.

While opening the URL in a browser, the browser sends the HTTP request to get the information from the API. 
The response is what you saw in the browser. 

#### But what is that format?

Yes!
That is a JSON response!

Let us investigate it!

JSON (JavaScript Object Notation) 
a lightweight data-interchange format. 
JSON is easy for humans to read and write and is easy for machines to parse and generate.

More info at: http://json.org

Tutorial at: http://www.w3schools.com/json/ 

#### JSON Syntax:

<img src="images/json.png" width=60%>

Now go back to the page you have, and try to identify data pairs, objects and arrays.

### Questions:
- What is the type of categories?
- How many authors are there?
- How many objects are there in items (Use https://jsonlint.com/ for a better visulaisation of the JSON file.

### lets try it in python!

In [5]:
APIRequest = 'https://www.googleapis.com/books/v1/volumes?q=isbn:1860462979'
try:
    response = requests.get(APIRequest)
    results = response.json()
    title = results['items'][0]['volumeInfo']['title']
    author = results['items'][0]['volumeInfo']['authors'][0]
    
    print (title)
    print (author)
except:
    print ("There was something wrong!")

Blindness
José Saramago


In [6]:
APIRequest = 'https://www.googleapis.com/books/v1/volumes?q=isbn:1292024208'
try:
    response = requests.get(APIRequest)
    results = response.json()
    title = results['items'][0]['volumeInfo']['title']
    author1 = results['items'][0]['volumeInfo']['authors'][0]
    author2 = results['items'][0]['volumeInfo']['authors'][1]
    print (title)
    print (author1)
    print (author2)
except:
    print ("There was something wrong!")      

Artificial Intelligence
Stuart Jonathan Russell
Peter Norvig


- Now you saw that in the previous example there are two authors, change the above code so you can iterate through the authors array and get all the authors automatically. 

In [7]:
APIRequest = 'https://www.googleapis.com/books/v1/volumes?q=isbn:1292024208'
try:
    response = requests.get(APIRequest)
    results = response.json()
    title = results['items'][0]['volumeInfo']['title']
    authors = []
    for author in results['items'][0]['volumeInfo']['authors']:
        authors.append(author)
    print (title)
    print (authors)

except:
    print ("There was something wrong!")      

Artificial Intelligence
['Stuart Jonathan Russell', 'Peter Norvig']


- You are given a csv file that contains a list of isbn numbers and you are asked to get:
    
    1- Title
    
    2- Author(s) - use a foor loop to print all the names 
    
    3- published Date
    
    
- Note: Replae the print in the except with 'continue' so it will not repeatedly print "There was something wrong!".

- Use this code to print all the authors in a pythonic way: 
(",".join([str(x) for x in array])

In [8]:
import pandas as pd
isbns = pd.read_csv("isbn.csv", names =["isbn"] ,header=None)

for index, row in isbns.iterrows():
    APIRequest = 'https://www.googleapis.com/books/v1/volumes?q=isbn:' + str(row[0])
    try:
        response = requests.get(APIRequest)
        results = response.json()
        title = results['items'][0]['volumeInfo']['title']
       
        authors = []
        for author in results['items'][0]['volumeInfo']['authors']:
            authors.append(author)
        
        publishedDate = results['items'][0]['volumeInfo']['publishedDate']
        print (title)
        print (", ".join([str(x) for x in authors])) 
        print (publishedDate)
        print (authors)
    except:
        continue

Artificial Intelligence
Stuart Jonathan Russell, Peter Norvig
2013-07-31
['Stuart Jonathan Russell', 'Peter Norvig']
Blindness
José Saramago
1997-10
['José Saramago']
Great Expectations
Charles Dickens
2018-06-28
['Charles Dickens']
Pride and Prejudice
Jane Austen
1992
['Jane Austen']
سمراويت
جابر، حجي
2012
['جابر، حجي']
Post Office
Charles Bukowski
2011-10-31
['Charles Bukowski']
العصفورية
غازي القصيبي, دار الساقي
2017-03-21
['غازي القصيبي', 'دار الساقي']
القضايا الكبرى
مالك نبي
2014-01-01
['مالك نبي']
ثلاثية غرناطة
رضوي عاشور, دار الشروق
2001
['رضوي عاشور', 'دار الشروق']
