## Requesting Data from Application Programming Interfaces (API's)
This notebook demonstrates the fundamentals of interacting with a web-hosted API for the sake of data retrieval. Much of this functionality is made available through the **requests** library which should have already been installed on your machine as part of the **Anaconda** python distribution. Documentation for the **requests** library is here:
https://docs.python-requests.org/en/latest/user/quickstart/. 

### 1.0. Prerequisites
If you find that the **requests**  library isn't already installed on your machine then simply run the following command in a new **Terminal** window in your Jupyter environment... just as you have in following labs.
-  python -m pip install requests

#### 1.1. Import the libaries that you'll be working with in the notebook

In [None]:
import os
import json
import pprint
import requests
import requests.exceptions
import pandas as pd

### 2.0. Issue a Request to an API Endpoint
The following function issues a **request** to a REST API endpoint via the HTTP request/response mechanism. It demonstrates returning the *JSON payload* of the **response** object as one of two **response_types**; either as a **string** or as a **Pandas DataFrame**.  

#### 2.1. Exception Handling:
In order to cope with the stateless nature of HTTP communications, the **get_api_response()** function implements extensive **exception handling**. When attempting to connect to an HTTP endpoint, the following response **status_codes** may be returned:
- **200:** Everything went okay, and the result has been returned (if any).
- **301:** The server is redirecting you to a different endpoint. This can happen when a company switches domain names, or an endpoint name is changed.
- **400:** The server thinks you made a bad request. This can happen when you don’t send along the right data, among other things.
- **401:** The server thinks you’re not authenticated. Many APIs require login ccredentials, so this happens when you submit the wrong credentials.
- **403:** The resource you’re trying to access is forbidden: you don’t have the right perlessons to see it.
- **404:** The resource you tried to access wasn’t found on the server.
- **503:** The server is not ready to handle the request.

In [None]:
def get_api_response(url, response_type):
    try:
        response = requests.get(url)
        response.raise_for_status()
    
    except requests.exceptions.HTTPError as errh:
        return "An Http Error occurred: " + repr(errh)
    except requests.exceptions.ConnectionError as errc:
        return "An Error Connecting to the API occurred: " + repr(errc)
    except requests.exceptions.Timeout as errt:
        return "A Timeout Error occurred: " + repr(errt)
    except requests.exceptions.RequestException as err:
        return "An Unknown Error occurred: " + repr(err)

    if response_type == 'json':
        result = json.dumps(response.json(), sort_keys=True, indent=4)
    elif response_type == 'dataframe':
        result = pd.json_normalize(response.json())
    else:
        result = "An unhandled error has occurred!"
        
    return result

#### 2.2. Unit test to ensure proper exception handling functionality

In [None]:
bad_url = "https://api.open-notify.org/this-api-doesnt-exist"
valid_url = "http://universities.hipolabs.com/search?name=middle"

response_type = ['json', 'dataframe']

In [None]:
json_string = get_api_response(bad_url, response_type[0])
print(json_string)

In [None]:
df = get_api_response(bad_url, response_type[1])
print(df)

#### 2.3. Unit test to ensure proper data retrieval functionality
Here we can see that when specifying **response_type[0]** we get back a **string in JSON format**, and when specifying **response_type[1]** we get back a **Pandas DataFrame**.  On closer inspection we can observe that the JSON payload is in the form of a **list** of **dictionaries**, each of which includes nested **lists** for the **domains** and **web_pages** fields in addition to the other fields that are formatted in simple **"key" : "value"** format. This presents a problem we will have to handle in order to have a correctly formed **DataFrame** because, as we learned when desiging **OLTP** databases, having multiple values in a single column violates the **First Normal Form**.

In [None]:
json_string = get_api_response(valid_url, response_type[0])
print(json_string)

In [None]:
df = get_api_response(valid_url, response_type[1])

print(df.shape)
print(df.columns)

df.info()

In [None]:
df

#### 2.3. Perform Desired Transformations
In any ETL process, there will be some form of data **transformation**.  Here we will explore transforming JSON data.

As identified above, the first issue we must handle is the nested **lists** that may contain multiple **domains** and **web_pages**. To do so we will exploring the advanced capabilities of the Pandas **json_normalize()** function, but first we will create a simplified function that retrieves a JSON object from an API.

In [None]:
def get_api_data(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
    
    except requests.exceptions.HTTPError as errh:
        return "An Http Error occurred: " + repr(errh)
    except requests.exceptions.ConnectionError as errc:
        return "An Error Connecting to the API occurred: " + repr(errc)
    except requests.exceptions.Timeout as errt:
        return "A Timeout Error occurred: " + repr(errt)
    except requests.exceptions.RequestException as err:
        return "An Unknown Error occurred: " + repr(err)
        
    return response.json()

In [None]:
json_data = get_api_data(valid_url)
print(json_data)

Next, we can **flatten** (aka, Normalize) the fields containing the nested lists (**domains** and **web_pages**) using the **record_path** parameter of the **pandas.json_normalize** function.

In [None]:
pd.json_normalize(json_data, record_path=['domains'])

We can confirm that the *domains* field has been flattened since we now have 10 observations where before we had only 9. However, we also want to include other fields; which we accomplish with the **meta** parameter. Note that we've also omitted the **state-province** field since it doesn't appear to contain any useful data. What's more, since it's possible for some **keys** to be missing in a JSON document, we can supress any errors using the **errors='ignore'** parameter.

In [None]:
df = pd.json_normalize(json_data,
                       record_path=['domains'],
                       meta=['country', 'name', 'alpha_two_code'],
                       errors='ignore')
df

Next, we can normalize the **web_pages** list to ensure an unique row for each of its unique values as we add it to the DataFrame.

In [None]:
df['web_pages'] = pd.json_normalize(json_data, record_path=['web_pages'])
df

Finally, we create a dictionary to **map** new column names to the old ones using the **rename()** function of the **pandas.DataFrame**.  We also demonstrate how columns can be reordered by simply passing a **list** of column names in the desired order.

In [None]:
column_name_map = {0 : "Domain",
                   "country" : "Country",
                   "name" : "Institution_Name",
                   "alpha_two_code" : "Country_Code",
                   "web_pages" : "Web_Address"
                  }

df.rename(columns=column_name_map, inplace=True)
df = df[['Institution_Name','Country','Country_Code','Domain','Web_Address']]
df

With the data having been **extracted** from an API, and any desired **transformations** having been accomplished, we can now **load** the data into any desired destination; e.g., SQL database, NoSQL database, or data lake (file system).

### 3.0. API Endpoint Authentication & Parameters

In [None]:
def get_api_response(url, headers, params):
    try:
        response = requests.get(url, headers=headers, params=params)
        response.raise_for_status()
    
    except requests.exceptions.HTTPError as errh:
        return "An Http Error occurred: " + repr(errh)
    except requests.exceptions.ConnectionError as errc:
        return "An Error Connecting to the API occurred: " + repr(errc)
    except requests.exceptions.Timeout as errt:
        return "A Timeout Error occurred: " + repr(errt)
    except requests.exceptions.InvalidHeader as erri:
        return "A Header Error occurred: " + repr(erri)
    except requests.exceptions.RequestException as err:
        return "An Unknown Error occurred: " + repr(err)
        
    return response.json()

Before you can start using the GitHub API, you'll need to generate a *personal access token* to enable you to authenticate to the API. To create your own token, navigate to https://github.com/settings/tokens and then click the **Generate New Token** button. 

In [None]:
GITHUB_TOKEN="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"
os.environ["GITHUB_TOKEN"] = GITHUB_TOKEN

In [None]:
token = os.getenv('GITHUB_TOKEN', '...')
print(token)

In [None]:
owner = "JTupitza-UVA"
repo = "DS-2002"
query_url = f"https://api.github.com/repos/{owner}/{repo}/issues"

params = {
    "state": "open",
}

headers = {'Authorization': f'token {token}'}

In [None]:
json_data = get_api_response(query_url, headers, params)
pprint.pprint(json_data)