
## APIs - Application Programming Interfaces


### The Guardian API

1. Register as a developer and sign up for an API key (<a href="https://bonobo.capi.gutools.co.uk/register/developer" target="_blank">here</a>).

2. Visit <a href="https://open-platform.theguardian.com/explore/" target="_blank">content explorer</a> to have an idea what data is included by quickly building queries and browsing the results without Python.


### Overview of the process
1. Install the "requests" library
2. Generate API request
3. Extract the data
4. Save the data

### Step 1. Install the "requests" library

<a href="https://2.python-requests.org/en/master/" target="_blank"> requests library</a> - for details

In [None]:
# install and import the library:
import requests

# To install the package: conda install requests

# To ignore warnings


### Step 2. Generate API request

#### 2.1 Specify the parameters

In [None]:
# Specify your own api key which you received via e-mail after registration:
api_key = 'd1f9dedc-a57a-4d2c-82dd-7f54a38584b8'

# Specify a particular endpoint (in this example we use `sections`):
api_endpoint = 'http://content.guardianapis.com/sections?'

# Specify a keyword (what you would put in a search field)
query = 'business'

In [None]:
# let's merge parameters to create URL using f-strings:

query_url = f"{api_endpoint}" \
            f"api-key={api_key}" \
            f"&q={query}"

# Let's look at the list of sections about business:


In [None]:
# Let's make a request to API endpoint:

api_endpoint = 'http://content.guardianapis.com/sections'

my_params = {
    'api-key': api_key,
    'q': "business",
}
r = requests.get(api_endpoint, my_params)
data = r.json()
data["response"]["results"]

#### 2.2 Filter the input parameters

We have used `q=` for the keyword paramater. More parameters can be set.

Information about search parameters is <a href="https://open-platform.theguardian.com/documentation/search" target="_blank">here</a>. 

In [None]:
api_endpoint = 'http://content.guardianapis.com/search'

my_params = {
    'q': "big data",
    'order-by': "newest",
    'show-fields': 'all',
    'section': "business",
    'page-size': 200,
    'api-key': api_key
}
r = requests.get(api_endpoint, my_params)
r.json()

# Step 3. Extract the data

#### 3.1 Output in 'json'

`.json()`

The Guardian: structure of the output: `response` -> general information, including `results` with a list of articles and their metadata.

In [None]:
# Extract the data
data = r.json()
data

#### 3.2 Convert to DataFrame

1. Get to `results`, where all articles are with `data['response']['results']`.
2. Use `pandas` package for data manipulation. In particular, `pd.json_normalize()` which takes as an argument JSON file and returns data as DataFrame.
3. Data clearing: select variables of interest & rename it for convenience.


In [None]:
# Let's import pandas library
# Execute the following command to update pandas to 1.0.3 version due to `version change`
#!pip install --user pandas==1.0.3
import pandas as pd
from pandas import json_normalize
pd.__version__
data["response"]["results"]

In [None]:
# 'Results' with all the articles summarized in more readable format 
ndata = json_normalize(data)
print(ndata)
ndata["response.results"]

In [None]:
# To create DataFrame
df = pd.DataFrame(data["response"]["results"])
# df.dtypes
df

#### 3.3 Observe DataFrame parameters

In [None]:
df.shape

In [None]:
df.size

In [None]:
df.columns
list(df.columns.values)

In [None]:
df["fields"]

In [None]:
# Fields columns contains objects
# In order to use the data inside these objects we are going to add new columns 
# to the dataframe based on the attributes of the fields' objects

In [None]:
headlines = pd.Series([],dtype="string")
i = 0
for field in (df["fields"]): 
    headlines = headlines.append(pd.Series(field["headline"],index=[i]))
    i=i+1

headlines
df["headline"] = headlines

shortUrls = pd.Series([],dtype="string")
i=0
for field in (df["fields"]):
    shortUrls = shortUrls.append(pd.Series(field["shortUrl"], index=[i]))
    i=i+1

df["shortUrl"] = shortUrls
    
standFirsts = pd.Series([], dtype="string")
i=0
for field in (df["fields"]):
    standFirsts = standFirsts.append(pd.Series(field["standfirst"], index=[i]))
    i=i+1
    
df["standFirst"] = standFirsts 
    
wordcounts = pd.Series([], dtype="int64")
i=0
for field in (df["fields"]):
    wordcounts = wordcounts.append(pd.Series(field["wordcount"], index=[i]))
    i=i+1
    
df["wordcount"] = wordcounts    

bodyTexts = pd.Series([], dtype="string")
i=0
for field in (df["fields"]):
    bodyTexts = bodyTexts.append(pd.Series(field["bodyText"], index=[i]))
    i=i+1
    
df["bodyText"] = bodyTexts

df

#### 3.4 Create a subset with the required variables

In [None]:
# Select variables of interest

df_subset=df[['id', 'type', 'sectionName', 'webPublicationDate', 'webTitle',\
                       'fields', 'headline', 'shortUrl', 'pillarName', \
             'standFirst','wordcount', 'bodyText']]

# Rename variables if necessary

df_subset.rename(columns={"sectionName": "section", "webPublicationDate": "date",\
"standFirst": "snippet", "shortUrl": "url", \
"bodyText": "article_text"}, inplace = True)

df_subset


In [None]:
# Format the dates

df_subset['date']=pd.to_datetime(df_subset['date']).dt.strftime('%Y-%m-%d')


In [None]:
# The final DataFrame:
df_subset


#### 3.5 Observe the DataFrame subset

In [None]:
# Article_text in one of the rows
df_subset["article_text"][155]

In [None]:
# URL of the article
df_subset["url"]

In [None]:
# Number of rows and columns
len(df_subset["id"])
len(df_subset.columns)

In [None]:
# Names of columns
df_subset.columns

### Step 4. Save the data

In [None]:
df_subset.to_excel(r'C:\Users\example\theguardian\Database.xlsx', encoding='utf-8')