For this tutorial, we're going to begin the first stage of a data pipeline.  We're going to be working with the same data source that we used for our previous two chapters' tutorials, but this time we're going straight to the source.  We will pull data directly from San Francisco's Open Data API.

#### 1. Import Packages

Let's start by importing the python packages that we'll need to extract our data.  We should be pretty familiar with pandas at this point (but if you need a refresher, check out Chapter 6 - Data Manipulation).  There are a couple new packages we're working with here though.  Let's go through them one by one.
- **requests** - This is a package designed to interact with web APIs.  We're going to use it to retreive data from an open data portal
- **json** - This package is super helpful for converting long JSON strings to Python data types like strings, numbers, dictionaries, and lists.
- **datetime** - This package is going to help us work with (surprise) dates and times.  We're going to use it help choose which data we extract from our API.

In [1]:
import pandas
import requests
import json
import datetime

#### 2. Get the date for yesterday

Now let's use our Python skills and a new package to get yesterday's date.  This is a really common pattern in Python automation workflows.  Rather than hard-code or input a date, we can use the **datetime** package to find out what the current date and time are when our process is being executed.  Then we can use the current date to find relative dates (in our case, yesterday).

Let's start by finding today's date.

In [4]:
now = datetime.datetime.now()
now

datetime.datetime(2024, 1, 21, 8, 15, 55, 498535)

The resulting datetime object we get is comprised of all the components of the date and time that the object was created.  In order, it lists the year, month, day, hour, minute, second, and millisecond

Now let's use a **timedelta** object to some math and subtract a day.  That will give us yesterday's date.

In [6]:
yesterday = now - datetime.timedelta(days=1)
yesterday

datetime.datetime(2024, 1, 20, 8, 15, 55, 498535)

You might notice that in the resulting datetime object, the day has changed, but the hour, minute, second, and millisecond are still the same.  That might be fine for some use-cases, but for our use-case we'd like to get data for the entire previous day.  Let's set the time to midnight instead.

We'll start by creating a **time** object with the default time (which happens to be midnight).  Then we'll combine the date and time to get an object that represents midnight yesterday.

In [8]:
# Set the time to midnight
midnight = datetime.time()

# Combine the date and time
yesterday_midnight = datetime.datetime.combine(yesterday, midnight)

yesterday_midnight

datetime.datetime(2024, 1, 20, 0, 0)

Note that now, the hour and minute are set to 0 and there are no seconds or microseconds

#### 3. Format Our Request

One of the wonderful things about working with an API is that we can control some of the aspects of our data as we're ingesting it.  Every API is different, so it's worth reviewing API documentation (which is readily available in this case at [this link](https://dev.socrata.com/foundry/data.sfgov.org/vw6y-z8j6).  The parameters we'll use for our query are documented at [this page](https://dev.socrata.com/docs/queries/).

Let's think about this in terms of the anatomy of a SQL statement.
```
SELECT
    service_request_id,
    requested_datetime,
    status_notes,
    lat,
    long,
    neighborhoods_sffind_boundaries,
    source,
    supervisor_district,
    media_url,
    point
FROM
    DATA_SOURCE
WHERE
    status_description = 'Open' AND requested_datetime > CURRENT_DAY - 1

```



If we start from the top there, the first thing we see is a list of fields.  According to the documentation, we can format that as a comma separated list and pass it along with the URL we're going to.

In [2]:
field_list = [
    'service_request_id',
    'requested_datetime',
    'status_notes',
    'lat',
    'long',
    'neighborhoods_sffind_boundaries',
    'source',
    'supervisor_district',
    'media_url',
    'point'
]

fields = "$select=" + ",".join(field_list)
fields

'$select=service_request_id,requested_datetime,status_notes,lat,long,neighborhoods_sffind_boundaries,source,supervisor_district,media_url,point'

TIP - the `.join` method is a method of a string object in Python.  When you pass a list into that method, it will use the string to concatenate all the items in that list and turn them into a string.

Moving down through the SQL query, the next component we see is the `FROM` clause.  That's pretty straightforward in our API.  It's just going to be the root URL of our dataset.

In [3]:
url = "https://data.sfgov.org/resource/vw6y-z8j6.json"

The next thing we notice moving down the SQL statement is the `WHERE` clause.  The API documentation shows a where clause that we can format similarly.  The first portion, where we restrict the records returned to only records where `status_description = 'Open'` is pretty straightforward, but the latter portion involving the `CURRENT_DAY` isn't.  That's where our datetime work from Step 2 comes into play.

We can format our datetime object into a string format that this API can use.  In this case, the API can read datetimes formatted like *YYYY-MM-DD T H:M:S*.  We'll use a method called **strftime** to return a string in this format.

In [10]:
formatted_yesterday = yesterday_midnight.strftime('%Y-%m-%dT%H:%M:%S.%f') 
formatted_yesterday

'2024-01-20T00:00:00.000000'

Now let's put that together with the rest of our `WHERE` clause.

In [12]:
where = f"$where=status_description='Open' and requested_datetime > '{formatted_yesterday}'"
where

"$where=status_description='Open' and requested_datetime > '2024-01-20T00:00:00.000000'"

Now we can put the components of our request all together in one URL.

In [14]:
full_url = url+"?"+fields+"&"+where
full_url

"https://data.sfgov.org/resource/vw6y-z8j6.json?$select=service_request_id,requested_datetime,status_notes,lat,long,neighborhoods_sffind_boundaries,source,supervisor_district,media_url,point&$where=status_description='Open' and requested_datetime > '2024-01-20T00:00:00.000000'"

#### 4. Make a request to the API

Now that we've got our URL formatted, we can actually make our request to the API.  Note that if you were to take that URL represented by the `full_url` variable and put it into a browser, you'd get a response from the API.  We're going to use the **requests** package to do the same thing with Python.

In [16]:
response = requests.get(full_url)
response

<Response [200]>

Note that we get a **response** object as a result showing its `status_code` property.  These codes tell us whether our request was successful.  Two of the most common responses include 200 (successful) and 404 (not found).  [This Link](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) contains a fairly comprehensive list of codes.

As long as the response code is in the 200s, we're good to go.  In this case, we're expecting a response that containing JSON information.  We'll call the `text` property to retreive that data.

In [21]:
response_text = response.text

Let's take a look at our results.  Before printing the entire response, we should probably see what the type of the response is and how long it is.  Then we can take a look at the first 1000 characters of the data to get an idea of what it looks like.

In [22]:
print(type(response_text))
print(len(response_text))
response_text[0:1000]

<class 'str'>
279299


'[{"service_request_id":"17791220","requested_datetime":"2024-01-20T00:04:55.000","status_notes":"open","lat":"37.79961","long":"-122.436035","neighborhoods_sffind_boundaries":"Union Street","source":"Web","supervisor_district":"2","point":{"latitude":"37.79961","longitude":"-122.436035","human_address":"{\\"address\\": \\"\\", \\"city\\": \\"\\", \\"state\\": \\"\\", \\"zip\\": \\"\\"}"}}\n,{"service_request_id":"17791233","requested_datetime":"2024-01-20T00:13:00.000","status_notes":"open","lat":"37.805177813017","long":"-122.438171046278","neighborhoods_sffind_boundaries":"Marina","source":"Mobile/Open311","supervisor_district":"2","point":{"latitude":"37.80517781","longitude":"-122.43817105","human_address":"{\\"address\\": \\"\\", \\"city\\": \\"\\", \\"state\\": \\"\\", \\"zip\\": \\"\\"}"}}\n,{"service_request_id":"17791244","requested_datetime":"2024-01-20T00:28:00.000","status_notes":"open","lat":"37.801254608387","long":"-122.440450640663","neighborhoods_sffind_boundaries":"M

This looks like we would expect for JSON data.  We can see it starts with a square bracket and then a curly bracket.  This tells me it's likely an array of objects.  The problem is that all that information is organized as a single string.  We'll need to change that into Python data types like lists and dictionaries to effectively work with it further.

#### 5. Format the Results of the Request

This is where our **json** package comes into play.  The json package can parse strings (or byte streams) and load our data into useful Python data types.  In this case, since our input is a string, we'll call the **loads** method.

In [24]:
results_list = json.loads(response_text)
results_list[0:2]

[{'service_request_id': '17791220',
  'requested_datetime': '2024-01-20T00:04:55.000',
  'status_notes': 'open',
  'lat': '37.79961',
  'long': '-122.436035',
  'neighborhoods_sffind_boundaries': 'Union Street',
  'source': 'Web',
  'supervisor_district': '2',
  'point': {'latitude': '37.79961',
   'longitude': '-122.436035',
   'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'}},
 {'service_request_id': '17791233',
  'requested_datetime': '2024-01-20T00:13:00.000',
  'status_notes': 'open',
  'lat': '37.805177813017',
  'long': '-122.438171046278',
  'neighborhoods_sffind_boundaries': 'Marina',
  'source': 'Mobile/Open311',
  'supervisor_district': '2',
  'point': {'latitude': '37.80517781',
   'longitude': '-122.43817105',
   'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'}}]

Now our data is formatted nicely into recognizeable and accessible Python data structures.  The array of features is represented as a list.  Each feature in that array is represented by a dictionary.  Now we can index into the list and access key/value pairs for each field in a feature.

That's useful, but it'd be way easier to manipulate that data if it was in a table.  So let's create a DataFrame out of the data.

In [26]:
df = pandas.DataFrame(results_list)
df.head()

Unnamed: 0,service_request_id,requested_datetime,status_notes,lat,long,neighborhoods_sffind_boundaries,source,supervisor_district,point,media_url
0,17791220,2024-01-20T00:04:55.000,open,37.79961,-122.436035,Union Street,Web,2,"{'latitude': '37.79961', 'longitude': '-122.43...",
1,17791233,2024-01-20T00:13:00.000,open,37.805177813017,-122.438171046278,Marina,Mobile/Open311,2,"{'latitude': '37.80517781', 'longitude': '-122...",
2,17791244,2024-01-20T00:28:00.000,open,37.801254608387,-122.440450640663,Marina,Mobile/Open311,2,"{'latitude': '37.80125461', 'longitude': '-122...",
3,17791271,2024-01-20T01:24:00.000,open,0.0,0.0,South of Market,Integrated Agency,6,"{'latitude': '0.0', 'longitude': '0.0', 'human...",
4,17791272,2024-01-20T01:24:00.000,open,0.0,0.0,South of Market,Integrated Agency,6,"{'latitude': '0.0', 'longitude': '0.0', 'human...",
