# The Guardian API

In the `05_web_scraping_beautiful_soup.ipynb` notebook, we saw examples on how BeautifulSoup can be used 
to parse messy HTML, to extract information, and to act as a rudimentary web crawler. 
We used The Guardian as an illustrative example about how this can be achieved. 
The reason for choosing The Guardian was because they provide a REST API to their servers. 
With the REST API it is possible to perform specific queries on their servers, and to receive 
current information from their servers according to their API guide (ie in JSON)

http://open-platform.theguardian.com/

In order to use their API, you will need to register for an API key. 
At the time of writing (Jan 28, 2020) this was an automated process that can be completed at 

https://bonobo.capi.gutools.co.uk/register/developer

On registration you will receive an API key which will look like: 303qwe2k-xxxx-xxxx-xxxx-eff86a248059

The API is documented here: 

http://open-platform.theguardian.com/documentation/

and Python bindings to their API are provided by The Guardian here

https://github.com/prabhath6/theguardian-api-python

and these can easily be integrated into a web-crawler based on API calls, rather than being based 
on HTML parsing, etc. 

We use four parameters in our queries here: 

1. `section`: the section of the newspaper that we are interested in querying. In this case we will look at 
the technology section 

2. `order-by`: We have specified that the newest items should be closer to the front of the query list 

3. `api-key`: In this notebook, the api-key is left as `test` (works here), but for *real* deployment of such a spider an API key obtained from Guardian should be specified. For the lab tasks, you should replace `test` API key with your personal API key. 

4. `page-size`: The number of results to return. 

In [1]:
from __future__ import print_function

import requests 
import json 

# Inspect all sections and search for technology-based sections

In [2]:
url = 'https://content.guardianapis.com/search?api-key=aeff7dbf-1329-4a39-98e2-e42c822ab954'
req = requests.get(url)
src = req.text 

In [3]:
json.loads(src)['response']['status']

'ok'

In [4]:
sections = json.loads(src)['response']

print(sections.keys())

dict_keys(['status', 'userTier', 'total', 'startIndex', 'pageSize', 'currentPage', 'pages', 'orderBy', 'results'])


In [5]:
sections['results']

[{'id': 'world/live/2021/feb/05/coronavirus-live-news-us-records-40000-deaths-in-two-weeks-mexico-runs-out-of-vaccine',
  'type': 'liveblog',
  'sectionId': 'world',
  'sectionName': 'World news',
  'webPublicationDate': '2021-02-05T15:17:50Z',
  'webTitle': "Coronavirus live news: Israel to ease lockdown; Von der Leyen compares UK vaccine 'speedboat' to EU 'tanker'",
  'webUrl': 'https://www.theguardian.com/world/live/2021/feb/05/coronavirus-live-news-us-records-40000-deaths-in-two-weeks-mexico-runs-out-of-vaccine',
  'apiUrl': 'https://content.guardianapis.com/world/live/2021/feb/05/coronavirus-live-news-us-records-40000-deaths-in-two-weeks-mexico-runs-out-of-vaccine',
  'isHosted': False,
  'pillarId': 'pillar/news',
  'pillarName': 'News'},
 {'id': 'us-news/live/2021/feb/05/joe-biden-donald-trump-impeachment-covid-coronavirus-marjorie-taylor-greene-live-updates',
  'type': 'liveblog',
  'sectionId': 'us-news',
  'sectionName': 'US news',
  'webPublicationDate': '2021-02-05T15:16:57

In [9]:
json.dumps(sections['results'][0], indent=2)

'{\n  "id": "world/live/2021/feb/05/coronavirus-live-news-us-records-40000-deaths-in-two-weeks-mexico-runs-out-of-vaccine",\n  "type": "liveblog",\n  "sectionId": "world",\n  "sectionName": "World news",\n  "webPublicationDate": "2021-02-05T15:17:50Z",\n  "webTitle": "Coronavirus live news: Israel to ease lockdown; Von der Leyen compares UK vaccine \'speedboat\' to EU \'tanker\'",\n  "webUrl": "https://www.theguardian.com/world/live/2021/feb/05/coronavirus-live-news-us-records-40000-deaths-in-two-weeks-mexico-runs-out-of-vaccine",\n  "apiUrl": "https://content.guardianapis.com/world/live/2021/feb/05/coronavirus-live-news-us-records-40000-deaths-in-two-weeks-mexico-runs-out-of-vaccine",\n  "isHosted": false,\n  "pillarId": "pillar/news",\n  "pillarName": "News"\n}'

In [10]:
for result in sections['results']: 
    if 'tech' in result['id'].lower(): 
        print(result['webTitle'], result['apiUrl'])

# Manual query on whole API

In [167]:
# Specify the arguments
args = {
    'section': 'technology', 
    'order-by': 'newest', 
    'api-key': 'aeff7dbf-1329-4a39-98e2-e42c822ab954', 
    'page-size': '100',
    'q' : 'privacy%20AND%20data'
}

# Construct the URL
base_url = 'http://content.guardianapis.com/search'
url = '{}?{}'.format(
    base_url, 
    '&'.join(["{}={}".format(kk, vv) for kk, vv in args.items()])
)

# Make the request and extract the source
req = requests.get(url) 
src = req.text

In [168]:
print('Number of byes received:', len(src))

Number of byes received: 60265


In [169]:
json.loads(src)

{'response': {'status': 'ok',
  'userTier': 'developer',
  'total': 3319,
  'startIndex': 1,
  'pageSize': 100,
  'currentPage': 1,
  'pages': 34,
  'orderBy': 'newest',
  'results': [{'id': 'technology/2021/feb/02/apple-iphone-update-solves-problem-of-unlocking-faceid-in-a-mask',
    'type': 'article',
    'sectionId': 'technology',
    'sectionName': 'Technology',
    'webPublicationDate': '2021-02-02T16:44:37Z',
    'webTitle': 'iPhone update lets Apple Watch users unlock Face ID in a mask',
    'webUrl': 'https://www.theguardian.com/technology/2021/feb/02/apple-iphone-update-solves-problem-of-unlocking-faceid-in-a-mask',
    'apiUrl': 'https://content.guardianapis.com/technology/2021/feb/02/apple-iphone-update-solves-problem-of-unlocking-faceid-in-a-mask',
    'isHosted': False,
    'pillarId': 'pillar/news',
    'pillarName': 'News'},
   {'id': 'world/2021/jan/28/apple-and-facebook-at-odds-over-privacy-move-that-will-hit-online-ads',
    'type': 'article',
    'sectionId': 'techno

The API returns JSON, so we parse this using the in-built JSON library. 
The API specifies that all data are returned within the `response` key, even under failure. 
Therefore, I have immediately descended to the response field 

In [170]:
response = json.loads(src)['response']
response.keys()

dict_keys(['status', 'userTier', 'total', 'startIndex', 'pageSize', 'currentPage', 'pages', 'orderBy', 'results'])

# task6

In [171]:
tech_section = response['results']

# status 
print("status:", response['status'])

# userTier 
print("userTier:", response['userTier'])

# total 
print("total:", response['total'])

# startIndex 
print("startIndex:", response['startIndex'])

# List the page size 
print("pagesize: ", response['pageSize'])

# currentPage
print("currentPage:", response['currentPage'])

# number of pages 
print("pages:", response['pages'])

# orderBy  
print("orderBy:", response['orderBy'])

# print(response['results'])

status: ok
userTier: developer
total: 3319
startIndex: 1
pagesize:  100
currentPage: 1
pages: 34
orderBy: newest


# task7

In [172]:
print(response['results'])

[{'id': 'technology/2021/feb/02/apple-iphone-update-solves-problem-of-unlocking-faceid-in-a-mask', 'type': 'article', 'sectionId': 'technology', 'sectionName': 'Technology', 'webPublicationDate': '2021-02-02T16:44:37Z', 'webTitle': 'iPhone update lets Apple Watch users unlock Face ID in a mask', 'webUrl': 'https://www.theguardian.com/technology/2021/feb/02/apple-iphone-update-solves-problem-of-unlocking-faceid-in-a-mask', 'apiUrl': 'https://content.guardianapis.com/technology/2021/feb/02/apple-iphone-update-solves-problem-of-unlocking-faceid-in-a-mask', 'isHosted': False, 'pillarId': 'pillar/news', 'pillarName': 'News'}, {'id': 'world/2021/jan/28/apple-and-facebook-at-odds-over-privacy-move-that-will-hit-online-ads', 'type': 'article', 'sectionId': 'technology', 'sectionName': 'Technology', 'webPublicationDate': '2021-01-28T19:14:49Z', 'webTitle': 'Apple and Facebook at odds over privacy move that will hit online ads', 'webUrl': 'https://www.theguardian.com/world/2021/jan/28/apple-and-

In [173]:
response['results'][0].keys()

dict_keys(['id', 'type', 'sectionId', 'sectionName', 'webPublicationDate', 'webTitle', 'webUrl', 'apiUrl', 'isHosted', 'pillarId', 'pillarName'])

In [174]:
response['results'][0]['webPublicationDate'][:4] == str(2021)

True

In [175]:
for result in response['results']: 
    if 'privacy' in result['id'].lower(): 
        print("PRIVACY")
        print("id: ", result['id'])
        print("webTitle: ", result['webTitle'])
        print("URL: ", result['apiUrl'])
        print()
        
    if 'whatsapp' in result['id'].lower(): 
        print("WHATSAPP")
        print("id: ", result['id'])
        print("webTitle: ", result['webTitle'])
        print("URL: ", result['apiUrl'])
        print()
        
    if 'signal' in result['id'].lower(): 
        print("SIGNAL")
        print("id: ", result['id'])
        print("webTitle: ", result['webTitle'])
        print("URL: ", result['apiUrl'])
        print()
        
    if '-ai-' in result['id'].lower(): 
        print("AI")
        print("id: ", result['id'])
        print("webTitle: ", result['webTitle'])
        print("URL: ", result['apiUrl'])
        print()
        
    if 'artificial intelligence' in result['id'].lower(): 
        print("Artificial intelligence")
        print("id: ", result['id'])
        print("webTitle: ", result['webTitle'])
        print("URL: ", result['apiUrl'])
        print()
        

PRIVACY
id:  world/2021/jan/28/apple-and-facebook-at-odds-over-privacy-move-that-will-hit-online-ads
webTitle:  Apple and Facebook at odds over privacy move that will hit online ads
URL:  https://content.guardianapis.com/world/2021/jan/28/apple-and-facebook-at-odds-over-privacy-move-that-will-hit-online-ads

WHATSAPP
id:  technology/2021/jan/26/uk-regulator-to-write-to-whatsapp-over-facebook-data-sharing
webTitle:  UK regulator to write to WhatsApp over Facebook data sharing
URL:  https://content.guardianapis.com/technology/2021/jan/26/uk-regulator-to-write-to-whatsapp-over-facebook-data-sharing

PRIVACY
id:  technology/2021/jan/25/google-announces-plan-to-tackle-privacy-issues-in-online-advertising
webTitle:  Google announces plan to tackle privacy issues in online advertising
URL:  https://content.guardianapis.com/technology/2021/jan/25/google-announces-plan-to-tackle-privacy-issues-in-online-advertising

WHATSAPP
id:  technology/2021/jan/24/whatsapp-loses-millions-of-users-after-ter

# task7-a

In [176]:
# Specify the arguments
args = {
    'section': 'business', 
    'order-by': 'newest', 
    'api-key': 'aeff7dbf-1329-4a39-98e2-e42c822ab954', 
    'page-size': '100',
    'q' : 'privacy%20AND%20data'
}

# Construct the URL
base_url = 'http://content.guardianapis.com/search'
url = '{}?{}'.format(
    base_url, 
    '&'.join(["{}={}".format(kk, vv) for kk, vv in args.items()])
)

# Make the request and extract the source
req = requests.get(url) 
src = req.text

In [177]:
response = json.loads(src)['response']
response.keys()

dict_keys(['status', 'userTier', 'total', 'startIndex', 'pageSize', 'currentPage', 'pages', 'orderBy', 'results'])

In [181]:
for result in response['results']: 
    if 'stock' in result['id'].lower(): 
        print("STOCK")
        print("id: ", result['id'])
        print("webTitle: ", result['webTitle'])
        print("URL: ", result['apiUrl'])
        print()
    
    if 'squeeze' in result['id'].lower(): 
        print("SQUEEZE")
        print("id: ", result['id'])
        print("webTitle: ", result['webTitle'])
        print("URL: ", result['apiUrl'])
        print()
        

STOCK
id:  business/live/2020/jun/24/asia-pacific-stock-markets-business-confidence-imf-growth-recession-business-live
webTitle:  European and US stock markets fall amid Covid-19 and trade fears - as it happened
URL:  https://content.guardianapis.com/business/live/2020/jun/24/asia-pacific-stock-markets-business-confidence-imf-growth-recession-business-live

STOCK
id:  business/2019/sep/14/technology-stock-market-ipos-2019-uber-lyft-slack-pinterest
webTitle:  Floating or falling? Tech companies that made stock market debuts in 2019
URL:  https://content.guardianapis.com/business/2019/sep/14/technology-stock-market-ipos-2019-uber-lyft-slack-pinterest

STOCK
id:  business/2019/jun/03/us-tech-stocks-alphabet-google-antitrust-investigation
webTitle:  US tech stocks slide as Google, Facebook and Apple fear antitrust investigations
URL:  https://content.guardianapis.com/business/2019/jun/03/us-tech-stocks-alphabet-google-antitrust-investigation



# task7-b

In [None]:
# 이거는 잘 모르겠다.

# task8

# Parsing the JSON

In [100]:
response = json.loads(src)['response']
print('The following are available:\n ', sorted(response.keys()))

The following are available:
  ['currentPage', 'orderBy', 'pageSize', 'pages', 'results', 'startIndex', 'status', 'total', 'userTier']


# Verifying the status code

It is important to verify that the status message is `ok` before continuing - if it is not `ok` no 'real' data 
will have been received. 

In [101]:
assert response['status'] == 'ok'

# Listing the results 

The API standard states that the results will be found in the `results` field under the `response` field. 
Furthermore, the URLs will be found in the `webUrl` field, and the title will be found in the `webTitle` 
field. 

First let's look to see what a single result looks like in full, and then I will print a restricted 
set of parameters on the full set of results .

In [102]:
print(json.dumps(response['results'][0], indent=2, sort_keys=True))

{
  "apiUrl": "https://content.guardianapis.com/technology/2021/feb/02/apple-iphone-update-solves-problem-of-unlocking-faceid-in-a-mask",
  "id": "technology/2021/feb/02/apple-iphone-update-solves-problem-of-unlocking-faceid-in-a-mask",
  "isHosted": false,
  "pillarId": "pillar/news",
  "pillarName": "News",
  "sectionId": "technology",
  "sectionName": "Technology",
  "type": "article",
  "webPublicationDate": "2021-02-02T16:44:37Z",
  "webTitle": "iPhone update lets Apple Watch users unlock Face ID in a mask",
  "webUrl": "https://www.theguardian.com/technology/2021/feb/02/apple-iphone-update-solves-problem-of-unlocking-faceid-in-a-mask"
}


In [103]:
for result in response['results']: 
    print(result['webUrl'][:70], result['webTitle'][:20])

https://www.theguardian.com/technology/2021/feb/02/apple-iphone-update iPhone update lets A
https://www.theguardian.com/world/2021/jan/28/apple-and-facebook-at-od Apple and Facebook a
https://www.theguardian.com/technology/2021/jan/27/facebook-earnings-s Facebook CEO Mark Zu
https://www.theguardian.com/technology/2021/jan/26/uk-regulator-to-wri UK regulator to writ
https://www.theguardian.com/technology/2021/jan/26/grindr-fined-norway Grindr fined £8.6m i
https://www.theguardian.com/technology/2021/jan/25/google-announces-pl Google announces pla
https://www.theguardian.com/technology/2021/jan/24/whatsapp-loses-mill WhatsApp loses milli
https://www.theguardian.com/technology/2021/jan/24/is-it-time-to-leave Is it time to leave 
https://www.theguardian.com/technology/2021/jan/21/samsung-galaxy-s21- Samsung Galaxy S21 U
https://www.theguardian.com/technology/2021/jan/21/facebook-admits-enc Facebook admits encr
https://www.theguardian.com/technology/2021/jan/20/facebook-under-pres Facebook 