# Data Acquisition

- request / response
- HTTP: plain text transportation
- HTML: document structure (compilation target for markdown)
- JSON: data interchange format based on JavaScript
- API: How things are interacted with programatically
- REST: a prescription for application urls

RESTful urls:

| HTTP Method | Endpoint         | Description                |
| ---         | ---              | ---                        |
| GET         | /{resource}/{id} | Read details of a resource |
| GET         | /{resource}      | A listing of resources     |
| POST        | /{resource}      | Create a new resource      |
| PATCH       | /{resource}/{id} | Update a resource          |
| DELETE      | /{resource}/{id} | Delete a resource          |

We'll focus on the GET methods as they are the ones that retrieve and let us read information.

In [2]:
import pandas as pd

In [1]:
# requests will allow us to interact with the web via python
import requests

In [6]:
some_urls = {
    'example': 'https://www.example.com',
    'swapi': 'https://swapi.dev/api/',
    'store': 'https://python.zgulde.net/api/v1'
}

In [3]:
# let's get some information from the urls above:

In [4]:
# the primary use of "get" in this case for us will be
# to grab the information from our requested domains

In [7]:
some_urls['example']

'https://www.example.com'

In [8]:
requests.get(some_urls['example'])

<Response [200]>

What we see above is a response of 200, a code indicating that the request was succesful.

let's put it in a variable and see what else we can do with it

In [10]:
response = requests.get(some_urls['example'])

In [12]:
response.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

In [13]:
# let's examine another domain:
response = requests.get(some_urls['swapi'])

In [14]:
response

<Response [200]>

In [15]:
response.text

'{"people":"https://swapi.dev/api/people/","planets":"https://swapi.dev/api/planets/","films":"https://swapi.dev/api/films/","species":"https://swapi.dev/api/species/","vehicles":"https://swapi.dev/api/vehicles/","starships":"https://swapi.dev/api/starships/"}'

In [16]:
response.json()

{'people': 'https://swapi.dev/api/people/',
 'planets': 'https://swapi.dev/api/planets/',
 'films': 'https://swapi.dev/api/films/',
 'species': 'https://swapi.dev/api/species/',
 'vehicles': 'https://swapi.dev/api/vehicles/',
 'starships': 'https://swapi.dev/api/starships/'}

In [18]:
requests.get('http://www.example.com').json()

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

 - We experience breakage trying to utilize the json() method call on fundamentally non-RESTful information of the human-readable HTML intended for human legibility and rendering

In [20]:
some_urls['swapi']

'https://swapi.dev/api/'

In [19]:
response.json()

{'people': 'https://swapi.dev/api/people/',
 'planets': 'https://swapi.dev/api/planets/',
 'films': 'https://swapi.dev/api/films/',
 'species': 'https://swapi.dev/api/species/',
 'vehicles': 'https://swapi.dev/api/vehicles/',
 'starships': 'https://swapi.dev/api/starships/'}

In [21]:
# seeing what we've been served, let's navigate to 
# people:
# we need to add /people to our url

In [23]:
requests.get(some_urls['swapi']+'/people/1').json()

{'name': 'Luke Skywalker',
 'height': '172',
 'mass': '77',
 'hair_color': 'blond',
 'skin_color': 'fair',
 'eye_color': 'blue',
 'birth_year': '19BBY',
 'gender': 'male',
 'homeworld': 'https://swapi.dev/api/planets/1/',
 'films': ['https://swapi.dev/api/films/1/',
  'https://swapi.dev/api/films/2/',
  'https://swapi.dev/api/films/3/',
  'https://swapi.dev/api/films/6/'],
 'species': [],
 'vehicles': ['https://swapi.dev/api/vehicles/14/',
  'https://swapi.dev/api/vehicles/30/'],
 'starships': ['https://swapi.dev/api/starships/12/',
  'https://swapi.dev/api/starships/22/'],
 'created': '2014-12-09T13:50:51.644000Z',
 'edited': '2014-12-20T21:17:56.891000Z',
 'url': 'https://swapi.dev/api/people/1/'}

In [24]:
# let's look at another data source:

In [25]:
response = requests.get(some_urls['store'])

In [None]:
# our url:

In [27]:
some_urls['store']

'https://python.zgulde.net/api/v1'

In [28]:
# our content:

In [26]:
response.text

'{"payload":{"routes":["/stores","/stores/{store_id}","/items","/items/{item_id}","/sales","/sales/{sale_id}"]},"status":"ok"}\n'

In [29]:
type(response.json())

dict

In [None]:
# let's turn that into a dictionary

In [35]:
response.json()

{'payload': {'routes': ['/stores',
   '/stores/{store_id}',
   '/items',
   '/items/{item_id}',
   '/sales',
   '/sales/{sale_id}']},
 'status': 'ok'}

In [34]:
stores_endpoint = response.json()['payload']['routes'][0]

In [36]:
# let's utilize the endpoint to get the stores page data
stores = requests.get(some_urls['store'] + stores_endpoint)

In [37]:
stores.json()

{'payload': {'max_page': 1,
  'next_page': None,
  'page': 1,
  'previous_page': None,
  'stores': [{'store_address': '12125 Alamo Ranch Pkwy',
    'store_city': 'San Antonio',
    'store_id': 1,
    'store_state': 'TX',
    'store_zipcode': '78253'},
   {'store_address': '9255 FM 471 West',
    'store_city': 'San Antonio',
    'store_id': 2,
    'store_state': 'TX',
    'store_zipcode': '78251'},
   {'store_address': '2118 Fredericksburg Rdj',
    'store_city': 'San Antonio',
    'store_id': 3,
    'store_state': 'TX',
    'store_zipcode': '78201'},
   {'store_address': '516 S Flores St',
    'store_city': 'San Antonio',
    'store_id': 4,
    'store_state': 'TX',
    'store_zipcode': '78204'},
   {'store_address': '1520 Austin Hwy',
    'store_city': 'San Antonio',
    'store_id': 5,
    'store_state': 'TX',
    'store_zipcode': '78218'},
   {'store_address': '1015 S WW White Rd',
    'store_city': 'San Antonio',
    'store_id': 6,
    'store_state': 'TX',
    'store_zipcode': '78220

In [38]:
# we know this is a dictionary as we rendered the json
type(stores.json())

dict

In [40]:
# at the base level I appear to have a payload and a status
stores.json().keys()

dict_keys(['payload', 'status'])

In [41]:
# what is the status?
stores.json()['status']

'ok'

In [43]:
type(stores.json()['payload'])

dict

In [44]:
stores.json()['payload'].keys()

dict_keys(['max_page', 'next_page', 'page', 'previous_page', 'stores'])

 - What have we observed so far here?
 - We examined the base url, navigated to the stores endpoint
 - the stores endpoint contained a payload and a status
 - when we examined the status, it told us the status was ok and nothing else of value
 - when we examined the payload, we observed both data and how to navigate the rest of the data (stores as well as page information based on keys)
 - The stores themselves are a list of dictionaries that we can easily cast into a Pandas DataFrame

In [48]:
stores_payload = stores.json()['payload']

In [49]:
# we have not only the stores themselves but
# also navigation information
stores_payload.keys()

dict_keys(['max_page', 'next_page', 'page', 'previous_page', 'stores'])

In [52]:
stores_payload['max_page']

1

In [54]:
# is there nothing here?
stores_payload['next_page']

In [55]:
type(stores_payload['next_page'])

NoneType

In [47]:
pd.DataFrame(stores.json()['payload']['stores'])

Unnamed: 0,store_address,store_city,store_id,store_state,store_zipcode
0,12125 Alamo Ranch Pkwy,San Antonio,1,TX,78253
1,9255 FM 471 West,San Antonio,2,TX,78251
2,2118 Fredericksburg Rdj,San Antonio,3,TX,78201
3,516 S Flores St,San Antonio,4,TX,78204
4,1520 Austin Hwy,San Antonio,5,TX,78218
5,1015 S WW White Rd,San Antonio,6,TX,78220
6,12018 Perrin Beitel Rd,San Antonio,7,TX,78217
7,15000 San Pedro Ave,San Antonio,8,TX,78232
8,735 SW Military Dr,San Antonio,9,TX,78221
9,8503 NW Military Hwy,San Antonio,10,TX,78231


In [57]:
stores.json()['payload']

{'max_page': 1,
 'next_page': None,
 'page': 1,
 'previous_page': None,
 'stores': [{'store_address': '12125 Alamo Ranch Pkwy',
   'store_city': 'San Antonio',
   'store_id': 1,
   'store_state': 'TX',
   'store_zipcode': '78253'},
  {'store_address': '9255 FM 471 West',
   'store_city': 'San Antonio',
   'store_id': 2,
   'store_state': 'TX',
   'store_zipcode': '78251'},
  {'store_address': '2118 Fredericksburg Rdj',
   'store_city': 'San Antonio',
   'store_id': 3,
   'store_state': 'TX',
   'store_zipcode': '78201'},
  {'store_address': '516 S Flores St',
   'store_city': 'San Antonio',
   'store_id': 4,
   'store_state': 'TX',
   'store_zipcode': '78204'},
  {'store_address': '1520 Austin Hwy',
   'store_city': 'San Antonio',
   'store_id': 5,
   'store_state': 'TX',
   'store_zipcode': '78218'},
  {'store_address': '1015 S WW White Rd',
   'store_city': 'San Antonio',
   'store_id': 6,
   'store_state': 'TX',
   'store_zipcode': '78220'},
  {'store_address': '12018 Perrin Beitel 

In [58]:
# just like i saw with the starwars api, if I had a number
# of people to iterate through, I merely need to find out how many
# people there are, or how to know when to terminate my loop

In [60]:
# instead of casting that as a dataframe, 
# just use it as a list of dictionaries
my_stores_initial = stores.json()['payload']['stores']

In [61]:
my_stores_initial

[{'store_address': '12125 Alamo Ranch Pkwy',
  'store_city': 'San Antonio',
  'store_id': 1,
  'store_state': 'TX',
  'store_zipcode': '78253'},
 {'store_address': '9255 FM 471 West',
  'store_city': 'San Antonio',
  'store_id': 2,
  'store_state': 'TX',
  'store_zipcode': '78251'},
 {'store_address': '2118 Fredericksburg Rdj',
  'store_city': 'San Antonio',
  'store_id': 3,
  'store_state': 'TX',
  'store_zipcode': '78201'},
 {'store_address': '516 S Flores St',
  'store_city': 'San Antonio',
  'store_id': 4,
  'store_state': 'TX',
  'store_zipcode': '78204'},
 {'store_address': '1520 Austin Hwy',
  'store_city': 'San Antonio',
  'store_id': 5,
  'store_state': 'TX',
  'store_zipcode': '78218'},
 {'store_address': '1015 S WW White Rd',
  'store_city': 'San Antonio',
  'store_id': 6,
  'store_state': 'TX',
  'store_zipcode': '78220'},
 {'store_address': '12018 Perrin Beitel Rd',
  'store_city': 'San Antonio',
  'store_id': 7,
  'store_state': 'TX',
  'store_zipcode': '78217'},
 {'store

In [62]:
# here we have: 
#  a max page
#  a next page

In [None]:
# my_stores_initial.append(########)

## Guidance for the exercise

1. Setup
    - url (base + endpoint)
    - empty list
1. Loop
    1. make a request
    1. handle the response, add to the list
    1. find the next url endpoint
        1. if it's None, stop looping
        1. if it's a string, use it to construct the next url
1. Turn the list into a dataframe

General Tips

- solve an easy problem first (the items endpoint), then apply that solution to the larger problem (sales)
- informational print statements are helpful as you are developing code, especially inside of a loop to see what changes
- Dont' be afraid to command + shift + p (command + shift + c for jupyter lab) "interrupt the kernel"
- curriculum says https://python.zgulde.net, that will work or use https://api.data.codeup.com