# Tutorial 07-02 Transforming and Loading Data

## Repeat Extract Logic from Tutorial 07-01

In the previous tutorial, we extracted data from an API and converted it into a pandas DataFrame.  Now we can put some transformations in place and put everything together.  

While we're copying our code over this time, let's take an opportunity to make our code a bit more reusable.  In the earlier chapters of this book, we brought up the concept of functional programming.  Now we can put that concept into practice.  


#### 1.  Create a function.
The one big thing we can abstract away is the extraction of data and conversion to a DataFrame.  There wasn't anything specific to our use-case in that (aside from the specific URL we formatted).  All that logic about getting a response from the URL, converting to Python data types, then ultimately to a DataFrame is totally repeatable for other URLs.  Now you'll turn that into a function we can use here and reuse elsewhere.

In [None]:
def extract_to_df(url):
    '''
    TODO - write docstring
    '''

    # Request data from the API
    response = requests.get(url)

    # Convert the response to JSON
    response_text = response.text

    # Convert JSON to a Python data types (lists/dictionaries)
    results_list = json.loads(response_text)

    # Convert the Python lists/dictionaries to a Pandas DataFrame
    df = pandas.DataFrame(results_list)

    # Return the DataFrame
    return df


The majority of the remaining logic we had was specific to our use case and might not be worth abstracting away into a function at this point.  The repeatable functionality you turned into a function will make your code a bit cleaner though.  Notice the last line of your code block below calls the function.

#### 2.  Reformat the code from the previous exercise
Now you'll consolidate and clean up the code from the previous exercise.  You can go back through all the cells and copy the code over individually or you can use the code block below.  Either way, the goal would be to start with your import statements and end with a DataFrame of data that you can work with further.

In [None]:
import pandas
import requests
import json
import datetime
import arcgis

# base url for the data
url = "https://data.sfgov.org/resource/vw6y-z8j6.json"

# build the field list
field_list = [
    'service_request_id',
    'requested_datetime',
    'status_notes',
    'service_name',
    'service_subtype',
    'lat',
    'long',
    'neighborhoods_sffind_boundaries',
    'source',
    'supervisor_district',
    'media_url',
    'point'
]
fields = "$select=" + ",".join(field_list)

# build the where statement
now = datetime.datetime.now()
start_date = now - datetime.timedelta(days=7)

# Set the time to midnight
midnight = datetime.time()

# Combine the date and time
start_day_midnight = datetime.datetime.combine(start_date, midnight)
start_day_string = start_day_midnight.strftime('%Y-%m-%dT%H:%M:%S.%f') 
where = f"$where=status_description='Open' and requested_datetime > '{start_day_string}'"

# set a record limit
limit = "&$limit=10000"

# combine the url components
full_url = url+"?" + fields + "&" + where + limit

# call our function to extract and convert to a DataFrame
df = extract_to_df(full_url)

In [None]:
print(df.shape)
df.head()

## Condition the Data

#### 1.  Write a lambda function.
Now you'll get your data ready to summarize.  If you want to summarize counts of records with specific service name values, you can add new columns to the raw data to summarize.  You'll use a common pattern combining a pandas method and a core Pandas concept called a *lambda function*.

Lambda functions are short one-line functions that we can use to execute relatively simple logic without having to define a function.  Let's take the following example of a function.

In [None]:
def street_cleaning(value):
    if value == 'Street and Sidewalk Cleaning':
        return 1
    else:
        return 0
    
print(street_cleaning('Street and Sidewalk Cleaning'))
print(street_cleaning('some other string'))

You've defined a function that returns a 1 if it receives the specific string "Street and Sidewalk Cleaning".  If it receives anything else, it returns a 0.  This is going to be really useful for summary purposes later.  You'll have to do this multiple times though, so writing it in a more concise way would be helpful.  This is where the lambda function comes in.

In [None]:
lambda_func = lambda x: 1 if x == 'Street and Sidewalk Cleaning' else 0

print(lambda_func('Street and Sidewalk Cleaning'))
print(lambda_func('some other string'))

#### 2.  Apply that lambda function to a column of data.
This gave you the same values as the more verbose function we defined previously, but was defined all in one line.  Now you can combine that with a pandas method called **apply** that applies your function to all the values in a column (or Series in pandas terminology).

In [None]:
df['service_name'].apply(
    lambda r: 1 if r== 'Street and Sidewalk Cleaning' else 0
)

Note that this returns another column.  You can now save this new column as a column in our DataFrame.  You can also do this with the other string values we're interested in.

#### 3.  Create new columns using lambda functions and the apply method
Now you'll use that same methodology to create multiple new columns that you'll use for summary purposes later.  If you're summarizing data by groups, sometimes it's also handy to have a column that just has a 1 for each row.

In [None]:
# street/sidewalk cleaning yes/no
df['street_sidewalk_cleaning'] = df['service_name'].apply(
    lambda r: 1 if r== 'Street and Sidewalk Cleaning' else 0
)

# graffiti yes/no
df['graffiti'] = df['service_name'].apply(
    lambda r: 1 if r== 'Graffiti' else 0
)

# counter column for total cases
df['total_cases'] = 1

In [None]:
df[['service_name','street_sidewalk_cleaning','graffiti','total_cases']]

## Summarize By Neighborhood

Now that you've gotten our raw data directly from the source, we can continue our transformation by summarizing the data by neighborhood.  This will allow you to join with the spatial data going forward.

#### 1.  Use Pandas to group and summarize the data
Pandas DataFrames have a built-in method called `groupby()`.  This accepts a column (or list of columns) to group by.  It returns a new object that isn't super useful on its own.  You'll need to use an aggregation method on that new object.  There are many, but you can use `.agg()` in this case so that you can specify which columns you want to summarize.

In [None]:
df_neighborhood = df.groupby("neighborhoods_sffind_boundaries").agg(
    {
        "total_cases": "sum",
        "street_sidewalk_cleaning": "sum",
        "graffiti": "sum"
    }
)
df_neighborhood

## Read Spatial Data and Merge with Summary Data

#### 1.  Read spatial data from a feature class.
Now you'll create a Spatially Enabled DataFrame using neighborhood data.  This is using the `spatial` accessor that comes with the `arcgis` package.  You'll use the `from_featureclass()` method to read a local feature class.

In [None]:
sedf_neighborhoods = pandas.DataFrame.spatial.from_featureclass(
    "../Chapter 06/Tutorial_06_02.gdb/SF_Find_Neighborhoods"
)

#### 2.  Merge non-spatial summary with spatial data.
You're going to use a pandas method called **merge** here.  If you need to review this topic there's a more in-depth discussion in the **Data Manipulation** chapter, but it's basically pandas' version of a table join.  You're going to join your geometry with your summarized data and create a new DataFrame.

In [None]:
sedf_merge = sedf_neighborhoods.merge(
    df_neighborhood, 
    how = 'inner', 
    left_on = 'name', 
    right_on = 'neighborhoods_sffind_boundaries'
)

## Write the Data to an Output

#### 1.  Write joined data to a feature class.
The "Load" portion of an ETL process can take several forms.  We're going to choose to overwrite here and discuss other options following the exercise.

In [None]:
sedf_merge.spatial.to_featureclass(
    "../Chapter 06/Tutorial_06_02.gdb/sf_311_cases_prev_7_days"
)