## Guide to Basic API Use: Getting News Articles from The Guardian
based off code from Sir Briane Paul V. Samson (https://brianesamson.com/)<br>
done as an activity for Data Science in DLSU (DATASCI)

#### Using The Guardian as a source
The Open Platform is The Guardian's a public web service for their content, however we need to get an authentication key to be able to use it.<br><br>

You can register one here:<br>
https://open-platform.theguardian.com/access/<br><br>

Their developer key is for non-commercial use, and it's free, so it's what we'll use.<br>
Note that when working with any other APIs, you should check how much leeway you have. We have quite a bit here with 12 calls a second and a large 5k calls a day.

Once we've registered for a key, we can use that along with their extensive documentation to form a request based on our needs. <br>

Documentation:<br>
https://open-platform.theguardian.com/documentation/<br><br>
Preview of responses when using API (as of 03/18/2021, looks to be less extensive, as it doesn't have some of the filters used later on):<br>
https://open-platform.theguardian.com/explore/<br>

<em><b>Note that the next cell takes from a text file, please assign your own API key to it.</b></em>

In [None]:
with open('api-key.txt') as f:
    contents = f.readlines()
api_key = contents[0]
api_key

I've already built a request through their Content endpoint (<em>Endpoint URL: https://content.guardianapis.com/search</em>).<br>

<b>Parameters:</b><br>
api-key: our API key<br>
format: json<br><br>

<b>Filters:</b><br>
from-date (2021-03-11) and to-date (2021-03-12): filter to only articles within these dates<br>
page (1, 2, and 3 over loops): to get a specific page of the articles they have<br>
page-size (200): number of articles we can get over a single request<br><br>

<b>Additional Info:</b><br>
show-fields (byline): to be able to get the author<br>
show-blocks (body): contained the full article<br>

In [None]:
import pandas as pd
import requests
import json

responses = []
for i in range(3):
    page = i+1
    from_date = '2021-03-11'
    to_date = '2021-03-12'
    URL = f'https://content.guardianapis.com/search?format=json&from-date={from_date}&to-date={to_date}&show-fields=byline&show-blocks=body&page={page}&page-size=200&api-key='
    responses.append(requests.get(URL + api_key).text)
    
len(responses)

#### Since we already got the response in a json format, we can just load it and extract the info we need:

date: ['fullPublicationDate']<br>
title: ['webTitle']<br>
fullArticle: ['blocks']['body'][0]['bodyTextSummary']<br>
author: ['fields']['byline']

In [None]:
news_source_json = json.loads(responses[0])
news_source_json['response']

In [None]:
news_source_json = news_source_json['response']['results']
news_source_json

In [None]:
sports_article = news_source_json[0]
sports_article

In [None]:
data = {}
data['date'] = sports_article['webPublicationDate']
data['title'] = sports_article['webTitle']
data['fullArticle'] = sports_article['blocks']['body'][0]['bodyTextSummary']
data['author'] = sports_article['fields']['byline']
print(f"Author: {data['author']}\n")
print(f"Full Article:\n{data['fullArticle']}")

### Notable Articles with Weird or Missing Info:
No Author<br>
https://www.theguardian.com/fashion/2021/mar/12/from-celebrity-sex-toys-to-connells-chain-this-weeks-fashion-trends<br><br>

Crossword article, no "full article"<br>
https://www.theguardian.com/crosswords/cryptic/28390<br>
https://www.theguardian.com/crosswords/cryptic/28391<br><br>

Liveblogs (one of many)<br>
https://www.theguardian.com/football/live/2021/mar/12/newcastle-v-aston-villa-premier-league-live<br><br>

Let's add a filter to filter out crosswords and other types of articles (live blogs) to the previous code.

In [None]:
news_json = []

for response in responses:
    articles = json.loads(response)['response']['results']
    for article in articles:
        if article['type'] != 'article': # just skips the article if it isn't considered an article
            continue
        data = {}
        data['date'] = article['webPublicationDate']
        data['title'] = article['webTitle']
        if len(article['blocks']['body']) > 0:
            data['fullArticle'] = article['blocks']['body'][0]['bodyTextSummary']
        else:
            data['fullArticle'] = None
        if 'fields' in article:
            data['author'] = article['fields']['byline']
        else:
            data['author'] = None
        news_json.append(data)

len(news_json)

In [None]:
with open("news.json", "w") as outfile:
    json.dump(news_json, outfile)