# Lab 7: Turning Queries into Functions

## First, let's setup the database engine like we did last week

### Load DB credentials

In [None]:
import json

# TODO: make sure to download credentials from https://canvas.upenn.edu/files/89654914/download?download_frd=1
# save them to the base directory for this repo
with open("pg-credentials.json") as creds:
    creds = json.load(creds)

PASSWORD = creds["PASSWORD"]
HOST = creds["HOST"]
USERNAME = creds["USERNAME"]
DATABASE = creds["DATABASE"]
PORT = creds["PORT"]

### Create DB engine

In [None]:
from sqlalchemy import create_engine

engine = create_engine(f"postgresql://{USERNAME}:{PASSWORD}@{HOST}:{PORT}/{DATABASE}")

In [None]:
# make sure it works

engine.execute("SELECT 'Hello'").fetchone()

## Review

## Review from Lecture

* Query templates
* Putting a query template into a function for reusable code

## 0. Review: Query Templating

### Avoid SQL injection

![](https://imgs.xkcd.com/comics/exploits_of_a_mom.png)

### Let's see how SQLAlchemy templates our queries

**Valid Inputs**

In [None]:
from sqlalchemy.sql import text

q = text("SELECT name, totaldocks, docksavailable FROM indego_station_status LIMIT :num")

In [None]:
str(q.bindparams(num=2).compile(bind=engine, compile_kwargs={"literal_binds": True}))

In [None]:
engine.execute(q, num=2).fetchall()

**Invalid Inputs**

In [None]:
str(q.bindparams(num='2; select * from andys_cookies;').compile(bind=engine, compile_kwargs={"literal_binds": True}))

In [None]:
engine.execute(q, num='2; select * from andys_cookies;').fetchall()

Good news :) SQLAlchemy saved us!

### If we templated the string using Python string functions...

In [None]:
qtext = "SELECT * FROM indego_station_status LIMIT {num}"

num = '2; SELECT * FROM andys_cookies;'
print(qtext.format(num=num))

**Uh oh**. Notice that another query was 'injected' into our templated query without proper quotes.

Let's execute it to see what happens...

In [None]:
engine.execute(qtext.format(num=num)).fetchall()

My cookie table was hacked!

### Aside... creating a table from nothing

We'll discuss operations like this in the coming weeks, but I created that cookie table with this query:

```SQL
CREATE TABLE andys_cookies AS
SELECT cookie_type, quantity 
FROM (
	VALUES ('peanut butter', 10), 
	       ('pecan', 20),
	       ('chocolate fudge', 5)
) AS c(cookie_type, quantity)
```

## 1. More Templating Asking Questions

In [None]:
# NOTE: the dataset originally had capitals in the names, so we need to quote the column names here
def fetch_five_vacant_buildings():
    query = text("""
        SELECT "ADDRESS", "BLDG_DESC", "ZIPCODE", "BUILD_RANK"
        FROM vacant_buildings
        LIMIT 5
    """)
    return engine.execute(query).fetchall()

In [None]:
fetch_five_vacant_buildings()

### 1.1 What are the five closest vacant buildings to Meyerson Hall?

Meyerson Hall has a lat/lng of `(39.952263,-75.1927827)`

In [None]:
def vacants_close_to_meyerson_hall(num_buildings=5):
    query = text("""
        -- enter your query here
    """)
    return engine.execute(query).fetchall()

### 1.2 What are the largest vacant buildings by zip code?

In [None]:
# fill in your code here

## 2. Give all vacant buildings in a neighborhood

### 2.1 Data

We have a neighborhood table

In [None]:
resp = engine.execute("SELECT neighborhood_name, ST_AsText(geom)  FROM philadelphia_neighborhoods LIMIT 1").fetchall()
resp

### 2.2 Build a function that takes a neighborhood name and returns all vacant buildings in it

In [None]:
def vacant_buildings_by_neighborhood(name):
    # write your function here
    pass

### 2.3 Let's Validate Inputs!

Validating inputs helps guide users if they make a mistake.

In [None]:
def is_valid_neighborhood_name(input_name):
    query = text("""
        SELECT neighborhood_name 
        FROM philadelphia_neighborhoods
        WHERE neighborhood_name = :input_name
    """)

    resp = engine.execute(query, input_name=input_name)
    if resp.rowcount > 0:
        return True
    return False

In [None]:
is_valid_neighborhood_name("Andy")

In [None]:
is_valid_neighborhood_name("Strawberry Mansion")

In [None]:
def get_vacant_buildings(neighborhood_name):
    if not is_valid_neighborhood_name(neighborhood_name):
        raise ValueError(f"'{neighborhood_name}' is not a valid neighborhood name")
    result = vacant_buildings_by_neighborhood(neighborhood_name)
    return result

In [None]:
get_vacant_buildings("Andy")

### 2.4 But what are the valid names? Let's print them in the error message too.

Write a function to return the names of the neighborhoods

In [None]:
def list_neighborhood_names():
    """Retrieve all neighborhood names, return as a list"""
    query = text("""
    --- put your query here
    """)
    # place your code here

Return should look like: 
```
['ACADEMY_GARDENS',
 'AIRPORT',
 'ALLEGHENY_WEST',
 'ANDORRA',
 'ASTON_WOODBRIDGE',
 'BARTRAM_VILLAGE',
 ...
```

### Now we can use the results of the list function to give users some options

In [None]:
def get_vacant_buildings(neighborhood_name):
    if not is_valid_neighborhood_name(neighborhood_name):
        neighborhood_list = list_neighborhood_names()
        raise ValueError(f"'{neighborhood_name}' is not a valid neighborhood name. Choose one of {neighborhood_list}")
    pass

In [None]:
get_vacant_buildings("Andy")

## 3. Fetching data from BigQuery

In [None]:
from google.cloud import bigquery
import geopandas as gpd
from shapely import wkt

# NOTE: you need to setup a service account (or use another auth method)
bqclient = bigquery.Client.from_service_account_json("MUSA-509-3337814ad805.json")

In [None]:
from shapely import wkt

query = f"""
SELECT (select value from unnest(all_tags) WHERE key = 'amenity') as amenity_type,
       COUNT(*) as num_amenities
  FROM `bigquery-public-data.geo_openstreetmap.planet_features`
 WHERE 'amenity' IN (SELECT key FROM UNNEST(all_tags))
 AND ST_INTERSECTSBOX(ST_Centroid(geometry), -75.280298,39.867005,-74.955831,40.137959)
GROUP BY 1
ORDER BY 2 DESC
"""
response = bqclient.query(query)

# print the rows
for row in response:
    print(row['amenity_type'].ljust(17), row['num_amenities'])

* [Parameterize queries](https://cloud.google.com/bigquery/docs/parameterized-queries) to avoid SQL Injection

BigQuery uses `@variable_name` notation for templating/parametrizing literals (strings, numbers, but not tables) in queries.

It makes use of the `QueryJobConfig` object in Python: <https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJobConfig.html>

In [None]:
def get_nearest_cafes(lng, lat, distance, amenity_type="cafe"):
    job_config = bigquery.QueryJobConfig(
        query_parameters=[
            bigquery.ScalarQueryParameter("poi_category", "STRING", amenity_type),
            bigquery.ScalarQueryParameter("lng", "FLOAT", lng),
            bigquery.ScalarQueryParameter("lat", "FLOAT", lat),
            bigquery.ScalarQueryParameter("distance", "FLOAT", distance)
        ]
    )
    query = f"""
        SELECT (select value from unnest(all_tags) WHERE key = 'name') as amenity_name, 
               (select value from unnest(all_tags) WHERE key = 'amenity') as amenity_type,
               (select value from unnest(all_tags) WHERE key = 'addr:street') as address,
               (select value from unnest(all_tags) WHERE key = 'phone') as phone_number,
               CAST(round(ST_Distance(ST_GeogPoint(@lng, @lat), ST_Centroid(geometry))) AS int64) as distance_away_meters,
               geometry
          FROM `bigquery-public-data.geo_openstreetmap.planet_features`
         WHERE ('amenity', @poi_category) IN (SELECT (key, value) FROM UNNEST(all_tags))
         and ST_DWithin(ST_GeogPoint(@lng, @lat), ST_Centroid(geometry), @distance)
         ORDER BY distance_away_meters ASC
    """
    response = bqclient.query(query, job_config=job_config)
    return response

In [None]:
meyerson_lnglat = (-75.1927795, 39.9522139)
response = get_nearest_cafes(meyerson_lnglat[0], meyerson_lnglat[1], 1000, 'cafe')

In [None]:
for row in response:
    description = f"{row['amenity_name']} is {row['distance_away_meters']} meters away"
    if row['address'] is not None:
        description = description + f" on {row['address']}"
    print(description + '\n')

In [None]:
from cartoframes.viz import Layer

cafes = gpd.GeoDataFrame(response.to_dataframe(), geometry=[wkt.loads(row.geometry).centroid for row in response], crs="epsg:4326")

Layer(cafes)

### Add input validation

In [None]:
query = f"""
SELECT DISTINCT (select value from unnest(all_tags) WHERE key = 'amenity') as amenity_type
  FROM `bigquery-public-data.geo_openstreetmap.planet_features`
 WHERE 'amenity' IN (SELECT key FROM UNNEST(all_tags))
 AND ST_INTERSECTSBOX(ST_Centroid(geometry), -75.280298,39.867005,-74.955831,40.137959)
"""
response = bqclient.query(query)

In [None]:
poi_valid_set = set([row['amenity_type'] for row in response])
poi_valid_set

In [None]:
def validate_poi_input(category):
    if category not in poi_valid_set:
        raise ValueError(f"`{category}` is not valid entry. Try one of {', '.join(poi_valid_set)}")

In [None]:
validate_poi_input('hi')

In [None]:
def get_nearest_cafes(lng, lat, distance, amenity_type="cafe"):
    validate_poi_input(amenity_type)
    job_config = bigquery.QueryJobConfig(
        query_parameters=[
            bigquery.ScalarQueryParameter("poi_category", "STRING", amenity_type),
            bigquery.ScalarQueryParameter("lng", "FLOAT", lng),
            bigquery.ScalarQueryParameter("lat", "FLOAT", lat),
            bigquery.ScalarQueryParameter("distance", "FLOAT", distance)
        ]
    )
    query = f"""
        SELECT (select value from unnest(all_tags) WHERE key = 'name') as amenity_name, 
               (select value from unnest(all_tags) WHERE key = 'amenity') as amenity_type,
               (select value from unnest(all_tags) WHERE key = 'addr:street') as address,
               (select value from unnest(all_tags) WHERE key = 'phone') as phone_number,
               CAST(round(ST_Distance(ST_GeogPoint(@lng, @lat), ST_Centroid(geometry))) AS int64) as distance_away_meters,
               geometry
          FROM `bigquery-public-data.geo_openstreetmap.planet_features`
         WHERE ('amenity', @poi_category) IN (SELECT (key, value) FROM UNNEST(all_tags))
         and ST_DWithin(ST_GeogPoint(@lng, @lat), ST_Centroid(geometry), @distance)
         ORDER BY distance_away_meters ASC
    """
    response = bqclient.query(query, job_config=job_config)
    return response

In [None]:
get_nearest_cafes(meyerson_latlng[1], meyerson_latlng[0], 1000, 'bicycle_repair_station').to_dataframe()

## OpenStreetMap Editing

Are you interested in OSM for your project? There are many ways to get OSM data, including semi-yearly updates on BigQuery. There are daily extracts for regions of the world at [GeoFabrik](https://download.geofabrik.de/). The shapefiles can be big and hard to get down to the region of interest.

### Is OSM lacking in a region you want? Start adding your house, your parents house, etc.

<https://www.openstreetmap.org/#map=17/39.95484/-75.20505>