# Identify areas of high/lowgh property taxes 

Giulia Carella

SDSC21 Workshop - Unlocking Spatial Analytics in the cloud with CARTO

## Context
Hotspot analysis consists in identifying statistically significant hot spots and cold spots using different spatial statistics, such as the Getis-Ord Gi* statistic.

## What we will do
In this demo, we will look at real estate data in Providence, Rhode Island USA. Specifically, by aggregating property data over a regular grid, we will compute the average property taxes per population, which will then be used to perform the hotspot analysis.

We will:

- Explore CARTO data catalog.
- Enrich the (point) property data with data from CARTO Spatial Features, including population, elevation, urbanity, and climatological data.
- Compute the average property tax normalized by the population at the grid cell level using a regular quadkey grid.
- Identify areas of high/low property taxes per population using the Getis-Ord Gi* statistics.
- Create a visualization with the results.

## Conclusions

Using CARTO we were able to explore and analyse 2018 property tax data in Providence, Rhode Island USA.
We enriched this data with population counts from CARTO's Spatial Features, computed the average property tax normalized by the population at the grid cell level and identified areas of high/low property taxes. By visualizing the results, we found that two main hotspots, corresponding respectively to areas of lower and higher property takes per population, are associated to independent areas in the city.

## 0. Set up

In this step we will connect to Google Bigquery using OAuth authentication.

### 0.a Import libraries

In [1]:
from google.cloud import bigquery
import pandas_gbq
import pydata_google_auth
import pandas as pd

### 0.b Set Google Bigquery credentials

In [2]:
SCOPES = [
    'https://www.googleapis.com/auth/cloud-platform'
]

credentials = pydata_google_auth.get_user_credentials(
    SCOPES,
    # Set auth_local_webserver to True to have a slightly more convienient
    # authorization flow. Note, this doesn't work if you're running from a
    # notebook on a remote sever, such as over SSH or with Google Colab.
    auth_local_webserver=False)

In [3]:
client_bq = bigquery.Client(credentials=credentials, project="XXXX")

### 0.c Specify the Google Bigquery project and table where the data are stored

In [4]:
project = 'cartobq'
dataset = 'demos_sdsc21'
table_name = 'property_sales_providence_2018_geo'
logger = f"{project}.{dataset}.{table_name}"

## 1. Enrich with CARTO Spatial Features data

In this step we will enrich the real estate data with data from CARTO Spatial Features which includes population, elevation, urbanity, and climatological data. Here, we will use the public version of this dataset which is available in a quadkey grid (https://docs.carto.com/analytics-toolbox-bq/overview/getting-started/) at zoom level 15.

### 1a. Subscribe to the CARTO's spatial features 
Go to your CARTO dashbord (https://carto.com/login) and on Data -> Your Subscriptions -> New Subscription -> Search for _Spatial Features - United States of America (Quadgrid 15)_ -> Subscribe for free.

To find the name of the table in Google Bigquery where the subscribed data are stores, go to QUICK ACTIONS -> Access in Bigquery.

### 1b. Explore the catalog 

In [5]:
q = f'''
    CALL `carto-un`.data.DATAOBS_SUBSCRIPTIONS('do-team.giulia_carto','')
'''
enrich_tables = client_bq.query(q).to_dataframe()
enrich_tables.head

Unnamed: 0,dataset_slug,dataset_name,dataset_country,dataset_category,dataset_license,dataset_provider,dataset_version,dataset_geo_type,table,associated_geography
0,cdb_spatial_fea_bd4173ae,Spatial Features - United States of America (Q...,United States of America,Derived,Public data,CARTO,v2,POLYGON,view_carto_derived_spatialfeatures_usa_quadgri...,carto-do-public-data.carto.geography_usa_quadg...


#### Table names

In [6]:
enrich_tables.table.to_list()

['view_carto_derived_spatialfeatures_usa_quadgrid15_v1_yearly_v2']

In [7]:
table_name_do = 'view_carto_derived_spatialfeatures_usa_quadgrid15_v1_yearly_v2'
table_name_do_geom = enrich_tables[enrich_tables.table==table_name_do].associated_geography[0] 

#### Variable names

In [8]:
q = f'''
    CALL `carto-un`.data.DATAOBS_SUBSCRIPTION_VARIABLES('do-team.giulia_carto','')
'''
enrich_vars = client_bq.query(q).to_dataframe()
enrich_vars.head()

Unnamed: 0,variable_slug,variable_name,variable_description,variable_type,variable_aggregation,dataset_slug,geography_slug
0,country_iso_85762e15,country_iso,"Country name, following the ISO 3166 standard",STRING,,cdb_spatial_fea_bd4173ae,cdb_quadgrid15_417f4a13
1,country_iso_a3_f31b3754,country_iso_a3,"Three-letter country code, following the ISO 3...",STRING,,cdb_spatial_fea_bd4173ae,cdb_quadgrid15_417f4a13
2,do_date_7008e06c,do_date,First day of the year of the corresponding data,DATE,,cdb_spatial_fea_bd4173ae,cdb_quadgrid15_417f4a13
3,education_55373808,education,"Number of education related POIs, incuding sch...",INTEGER,SUM,cdb_spatial_fea_bd4173ae,cdb_quadgrid15_417f4a13
4,elevation_b2172c5b,elevation,Average elevation based on ASTER Global Digita...,FLOAT,AVG,cdb_spatial_fea_bd4173ae,cdb_quadgrid15_417f4a13


In [10]:
enrich_vars[enrich_vars.dataset_slug == 'cdb_spatial_fea_bd4173ae']

Unnamed: 0,variable_slug,variable_name,variable_description,variable_type,variable_aggregation,dataset_slug,geography_slug
0,country_iso_85762e15,country_iso,"Country name, following the ISO 3166 standard",STRING,,cdb_spatial_fea_bd4173ae,cdb_quadgrid15_417f4a13
1,country_iso_a3_f31b3754,country_iso_a3,"Three-letter country code, following the ISO 3...",STRING,,cdb_spatial_fea_bd4173ae,cdb_quadgrid15_417f4a13
2,do_date_7008e06c,do_date,First day of the year of the corresponding data,DATE,,cdb_spatial_fea_bd4173ae,cdb_quadgrid15_417f4a13
3,education_55373808,education,"Number of education related POIs, incuding sch...",INTEGER,SUM,cdb_spatial_fea_bd4173ae,cdb_quadgrid15_417f4a13
4,elevation_b2172c5b,elevation,Average elevation based on ASTER Global Digita...,FLOAT,AVG,cdb_spatial_fea_bd4173ae,cdb_quadgrid15_417f4a13
...,...,...,...,...,...,...,...
132,wind_mar_11d0df9d,wind_mar,Monthly wind speed (m s-1) aggregated across a...,FLOAT,AVG,cdb_spatial_fea_bd4173ae,cdb_quadgrid15_417f4a13
133,wind_may_86020615,wind_may,Monthly wind speed (m s-1) aggregated across a...,FLOAT,AVG,cdb_spatial_fea_bd4173ae,cdb_quadgrid15_417f4a13
134,wind_nov_8a788853,wind_nov,Monthly wind speed (m s-1) aggregated across a...,FLOAT,AVG,cdb_spatial_fea_bd4173ae,cdb_quadgrid15_417f4a13
135,wind_oct_c901cc44,wind_oct,Monthly wind speed (m s-1) aggregated across a...,FLOAT,AVG,cdb_spatial_fea_bd4173ae,cdb_quadgrid15_417f4a13


### 1c. Use CARTO Analytics Toolbox to enrich the property data with CARTO's Spatial Features

In this step we will use the function ENRICH_GRID to associate each point data with the corresponding grid cell in CARTO Spatial Features.

**data.ENRICH_GRID**(grid_type, input_query, input_index_column, data_query, data_geography_column, variables, output)

https://docs.carto.com/analytics-toolbox-bq/sql-reference/data/#enrich_grid

- grid_type: h3, quadkey, s2 or geohash.
- input_query: query to select the data to be enriched (Standard SQL).
- input_index_column: name of a column in the query that contains the grid indices.
- data_query:  query used to enrich the cells provided in the input query (Standard SQL).
- data_geography_column: name of the geography column provided in the data_query.
- variables:  columns that will be used to enrich the input and their corresponding aggregation method.
- output: name of an output table to store the results.

As a result of this process, each input grid cell will be enriched with the data of the enrichment query that spatially intersects it. When the input cell intersects with more than one feature of the enrichment query, the data is aggregated using the aggregation methods specified.

Valid aggregation methods are: _SUM_ (extensive properties), _MIN_ (intensive property), _MAX_ (intensive property), _AVG_ (area-weighted average, intensive property), _COUNT_. For other types of aggregation, the ENRICH_GRID_RAW procedure can be used to obtain non-aggregated data that can be later applied to any desired custom aggregation.

In [13]:
selected_enrich_vars = ['education',
              'financial',
              'food_drink',
              'healthcare',
              'leisure',
              'population',
              'retail',
              'tourism',
              'transportation']

In [14]:
q = f"""

DECLARE enrich_columns STRING;
SET enrich_columns = (
WITH selected_columns as (
    SELECT column_name 
    FROM do-team.giulia_carto.INFORMATION_SCHEMA.COLUMNS
    WHERE 
    table_name = '{table_name_do}' 
    AND column_name IN UNNEST({selected_enrich_vars})
)
SELECT STRING_AGG(column_name) AS columns FROM selected_columns
);

EXECUTE IMMEDIATE format('''
CREATE TEMP TABLE enrich_table AS(
   SELECT %s, geoid
   FROM `do-team.giulia_carto.{table_name_do}`
)''',enrich_columns);

CALL `carto-un`.data.ENRICH_GRID(
   'quadkey',
   R'''
   SELECT *
   FROM enrich_table
   ''',
   'geoid',
   R'''
   SELECT geom, total_assessment, total_taxes 
   FROM `{logger}`
   ''',
   'geom',
   [('total_assessment', 'avg'), ('total_taxes', 'avg')],
   ['`{logger}_enriched`']
);

CREATE OR REPLACE TABLE `{logger}_enriched` AS(
SELECT a.*, b.geom
FROM  `{logger}_enriched` a
JOIN `{table_name_do_geom}`b
ON a.geoid = b.geoid
WHERE total_assessment_avg IS NOT NULL
);

DROP TABLE enrich_table;
"""
client_bq.query(q).result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f2bc04c88d0>

## 2. Use CARTO Analytics Toolbox to find the hotspots for the average property taxes per population

### 2a. Compute the average propery taxes per population

In [15]:
q = f"""
CREATE OR REPLACE TABLE `{logger}_enriched` AS(
    SELECT *, total_taxes_avg/population as total_taxes_avg_norm
    FROM  `{logger}_enriched`
);
"""
client_bq.query(q).to_dataframe()

### 2b. Compute the Getis-Ord Gi* statistics

In this step we will use the function GETIS_ORD_QUADKEY to compute the Getis-Ord* statistics for each grid cell.

**statistics.GETIS_ORD_QUADKEY**(input, size, kernel)

- input: input data with the indexes and values of the cells.
- size: size of the quadkey kring (distance from the origin). This defines the area around each index cell that will be taken into account to compute its Gi* statistic.
- kernel: kernel function to compute the spatial weights across the kring. Available functions are: uniform, triangular, quadratic, quartic and gaussian.

In [16]:
q = f"""
CREATE OR REPLACE TABLE `{logger}_enriched` AS(
    WITH tmp AS(
        SELECT getis.index as geoid, getis.gi as gi
        FROM (
            SELECT ARRAY_AGG(STRUCT(geoid, total_taxes_avg_norm)) AS array_data
            FROM `{logger}_enriched`
        ) input_data,
        UNNEST(`carto-un`.statistics.GETIS_ORD_QUADKEY(array_data, 3, 'gaussian')) AS getis 
    )
    SELECT b.*, a.gi
    FROM tmp a
    JOIN `{logger}_enriched` b
    ON a.geoid = b.geoid
)

"""
client_bq.query(q).result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f2bc04ec310>

## 3. Viz results

Go to https://gcp-us-east1.app.carto.com/ create a map with the results!