<a href="https://colab.research.google.com/github/Liping-LZ/BDAI_2324/blob/main/Internet%20data%20collection/Extracting_API_data_(example_two).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Extracting API data
In our previous tutorial we looked at extracting API data. In this case we looked at fairly simple data mining activities in XML. Here will will expand this by building a program to interact with a series of APIs. Specifically we will be following this workflow:
1. Creating a random/fake person using the _RandomUser_ API;
2. Predicting the age of the user using the _Agify.io_ API;
3. Predicting if the nationality of the user using the _Nationalize.io_ API.

As you can see its a fairly silly task, however these APIs require no security keys to access (which makes life far easier for us) and its enough to give us some basic exposure to working with APIs.

Let's begin!

In [None]:
import json
import requests

random_user = "https://randomuser.me/api/"

Everything should be self explanatory in the above, we import the relevant libraries/packages (_json_ to deal with the API's JSON outputs and _requests_ to make the API calls) and we create a variable to store the API endpoint for our fake person generator (the _RandomUser_ API). The next step will be to call the API:

In [None]:
def name_generator(url):
  r = requests.get(url)
  return r

new_user = name_generator(random_user)
new_user.content

b'{"results":[{"gender":"male","name":{"title":"Mr","first":"Mike","last":"Ferguson"},"location":{"street":{"number":6091,"name":"Manor Road"},"city":"Wolverhampton","state":"Lancashire","country":"United Kingdom","postcode":"U1 5TP","coordinates":{"latitude":"-9.7850","longitude":"-21.3812"},"timezone":{"offset":"+3:30","description":"Tehran"}},"email":"mike.ferguson@example.com","login":{"uuid":"b3996f9e-a5e7-4b66-9fd0-09180ab2e5ec","username":"ticklishbird156","password":"womans","salt":"Pv3DRAze","md5":"e7504fcfb8d25fafa34544718b22a6c0","sha1":"a56ce13e6fb646de96ec04ff6896e0d3c4a4bccc","sha256":"ff99c08bd7320ec76c35e3c23d2dda2033e79080782e30f22722bffa0a1cb090"},"dob":{"date":"1976-11-12T12:01:22.143Z","age":46},"registered":{"date":"2004-02-24T10:35:32.762Z","age":18},"phone":"013873 41079","cell":"07453 390313","id":{"name":"NINO","value":"YA 92 20 49 S"},"picture":{"large":"https://randomuser.me/api/portraits/men/50.jpg","medium":"https://randomuser.me/api/portraits/med/men/50.jp

Here we create a simple function to send the API request (using the _requests_ package as above). You may note the function we use is _get_ which (unsurprisingly) translates as a _GET_ HTTP request in the RESTful framework (see previous Notebook if that is not clear).

We then use our function to create a new variable (_new\_user_). Note that we are returned, in this case, a JSON object - effectively a dictionary. The _requests_ package will guess this for us so may return an XML object or similar with other APIs. From this JSON object we specify the key "content" to access the information we want. We could also specify keys such as "encoding" or "status_code" if that's the information we require (see more [here](https://docs.python-requests.org/en/latest/index.html)).

The "content" object, however, is a Python string. It would be more useful if we could interact with this as JSON:

In [None]:
response_dict = new_user.json()
response_dict

{'results': [{'gender': 'male',
   'name': {'title': 'Mr', 'first': 'Mike', 'last': 'Ferguson'},
   'location': {'street': {'number': 6091, 'name': 'Manor Road'},
    'city': 'Wolverhampton',
    'state': 'Lancashire',
    'country': 'United Kingdom',
    'postcode': 'U1 5TP',
    'coordinates': {'latitude': '-9.7850', 'longitude': '-21.3812'},
    'timezone': {'offset': '+3:30', 'description': 'Tehran'}},
   'email': 'mike.ferguson@example.com',
   'login': {'uuid': 'b3996f9e-a5e7-4b66-9fd0-09180ab2e5ec',
    'username': 'ticklishbird156',
    'password': 'womans',
    'salt': 'Pv3DRAze',
    'md5': 'e7504fcfb8d25fafa34544718b22a6c0',
    'sha1': 'a56ce13e6fb646de96ec04ff6896e0d3c4a4bccc',
    'sha256': 'ff99c08bd7320ec76c35e3c23d2dda2033e79080782e30f22722bffa0a1cb090'},
   'dob': {'date': '1976-11-12T12:01:22.143Z', 'age': 46},
   'registered': {'date': '2004-02-24T10:35:32.762Z', 'age': 18},
   'phone': '013873 41079',
   'cell': '07453 390313',
   'id': {'name': 'NINO', 'value': 'Y

Much prettier. Now we can use the JSON keys to get individual items in this output:

In [None]:
response_dict['results'][0]['name']['first']

'Mike'

An excellent choice of first name. Note the way we need to query the object - this is not always obvious and you need to look carefully at the object being returned. In this case the parent node is a choice of either 'results' or 'info' - we want the former. When we look at _response\_dict['results']_ we see it contains a dictionary inside a list (a list with one item). We need to specify the list index 0 to return the inner dictionary. From her we can specify the key 'name' and within the nested dictionary this represents we specify the final key 'first' (i.e. firstname).

Every API returns things a bit differently. Ultimately you just need to inspect the output and adjust accordingly.

Let's tidy this up into a reusable function:


In [None]:
def name_generator(url):
  new_user = []
  r = requests.get(url)
  response_dict = r.json()
  new_user.append({'name': response_dict['results'][0]['name']['first'], \
                 'age': response_dict['results'][0]['dob']['age'], \
                 'country': response_dict['results'][0]['location']['country']})

  return new_user

In [None]:
new_user = name_generator(random_user)
new_user

[{'name': 'Melina', 'age': 63, 'country': 'Norway'}]

Now we can generate a random user, our next task is to create a function to guess their age (using the _Agify.io_ API). This uses the following endpoint:

In [None]:
aging_url = "https://api.agify.io/?name="

However, as you may have spotted, this is not really a full endpoint. The API needs an input (the name) to generate the ouput (the predicted age) and this input needs to be included in the endpoint.

As the input (name) will vary we will dynamically join this information to the endpoint before making the call like so:

In [None]:
def age_generator(url, user):
  search = url + user[0]['name']
  r = requests.get(search)
  return r

Everything here is the same as before, except for us concatenating (joining) the user name to the end of the URL. I.e. if we pass the url as "https://api.agify.io/?name=" and the user we generated earlier, our joined up endpoint (the _search_ variable) would be "https://api.agify.io/?name=Melina". Let's test it out:

In [None]:
x = age_generator(aging_url, new_user)
x.content

b'{"age":38,"count":8424,"name":"Melina"}'

Excellent - we have our (admittedly pretty inaccurate in this case) guess! However, we also get other data we don't need so let's tidy up the function and get it to return just the predicted age:

In [None]:
def age_generator(url, user):
  search = url + user[0]['name']
  r = requests.get(search)
  response_dict = r.json()

  return response_dict["age"]

x = age_generator(aging_url, new_user)
x

38

Again, this is all as we saw previously although you may note we have a far easier time extracting the information as there is no list or nested dictionaries (we need only pass the 'age' key).

Let's do the same for our nationality predictor (using the _Nationalize.io_ API):

In [None]:
country_url = "https://api.nationalize.io/?name="

def country_generator(url, user):
  search = url + user[0]['name']
  r = requests.get(search)
  return r

In [None]:
y = country_generator(country_url, new_user)
y.content

b'{"country":[{"country_id":"AR","probability":0.209},{"country_id":"GR","probability":0.123},{"country_id":"CY","probability":0.042},{"country_id":"DE","probability":0.042},{"country_id":"PE","probability":0.038}],"name":"Melina"}'

Again, the Python code is basically the same as before, but we are back with the messy nested dictionaries/list. We actually get five predictions here, but fortunately they are in order of probability. To keep things simple we will just use the prediction with the highest probability (the first one):

In [None]:
def country_generator(url, user):
  search = url + user[0]['name']
  r = requests.get(search)
  response_dict = r.json()

  return response_dict["country"][0]["country_id"]

y = country_generator(country_url, new_user)
y

'AR'

Checking online 'AR' is the country code for Argentina which is a very long way from Norway. Bad bot.

With these elements in place we can put it all together with a function that will create a fake user and then guess their age and nationality:

In [None]:
def fake_user_guesser(name_url, aging_url, country_url):
  fake_user = name_generator(random_user)
  age_guess = age_generator(aging_url, fake_user)
  country_guess = country_generator(country_url, fake_user)

  return f"New user is {fake_user[0]['name']}, aged {fake_user[0]['age']} from {fake_user[0]['country']}. We guessed the age was {age_guess} and the country was {country_guess}."

In [None]:
output = fake_user_guesser(random_user, aging_url, country_url)
output

'New user is Sander, aged 36 from Denmark. We guessed the age was 47 and the country was NL.'

And there we go. Our predictions are much improved this time. 36 is closer to 47 and Denmark is much closer to the Netherlands than Norway is to Argentina.

We could obviously think of many improvements to this function - such as actually getting the country name from the code generated or expanding it to compare the other nationality predictions. However, we have met our objectives!

To finish off let's make a few more calls and see how our predictions fair. However, to avoid stressing the APIs we should leave a little gap between calls. We can use the in-built _time_ to set a 5 second rest between each:

In [None]:
import time
for i in range(10):
  output = fake_user_guesser(random_user, aging_url, country_url)
  print(output)
  if i != 9:
    print("\n")
  time.sleep(5)

New user is Saša, aged 76 from Serbia. We guessed the age was 35 and the country was RS.


New user is Elisete, aged 70 from Brazil. We guessed the age was 55 and the country was BR.


New user is Brooklyn, aged 31 from New Zealand. We guessed the age was 37 and the country was CA.


New user is Claudilene, aged 71 from Brazil. We guessed the age was 53 and the country was BR.


New user is Simeon, aged 55 from Serbia. We guessed the age was 58 and the country was BG.


New user is Aubree, aged 35 from United States. We guessed the age was 40 and the country was US.


New user is Bill, aged 60 from Australia. We guessed the age was 73 and the country was US.


New user is Summer, aged 49 from New Zealand. We guessed the age was 25 and the country was CN.


New user is Elsa, aged 68 from Mexico. We guessed the age was 56 and the country was PT.


New user is Gabrielle, aged 76 from Canada. We guessed the age was 49 and the country was AU.


Overall a decent set of guesses, particularly for users from Brazil or the US!

## Additional Tasks
While our basic functionality seems to work, there are no measures in place to deal with an API failing ... which given enough use would be a reasonable expectation. Can you rewrite the code to deal with this potential issues _Hint: the request package will allow you to check the status of the response using \<variable\>.status\_code_. Good luck!