# Parking in Seattle

Driving in Seattle is quickly becoming very similar to driving in cities like San Francisco, Silicon Valley or Los Angeles: more and more companies choose to settle or open their offices in Seattle so they can tap into the tech community that Seattle has to offer. With that, parking in Seattle is getting harder by day.

Paid Parking Occupancy dataset provided by the City of Seattle Department of Transportation provides a view into around 300 million parking transactions annually from around 12 thousands parking spots on roughly 1,500 block faces. The dataset does not include any transaction for Sundays as there is no paid parking. Most of the parking spots have a 2-hour limit.

## Load the modules

First, we'll load the necessary modules and instantiate the `BlazingContext()`.

In [None]:
from blazingsql import BlazingContext
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

import cudf
import pandas as pd
import json

pd.options.display.max_rows = 100

cluster = LocalCUDACluster()
client = Client(cluster)

bc = BlazingContext(dask_client=client)

# Register S3 bucket
Next, since we are reading from S3 we need to register the bucket we hold the data in.

In [None]:
_ = bc.s3(
    'bsql'
    , bucket_name = 'bsql'
)

# Create the tables

Now that we registered the S3 bucket we can create our tables. First, specify the locations.

In [None]:
transactions_partitions_cnt = 40
transactions_path = 's3://bsql/data/seattle_parking/parking_MayJun2019.parquet/partition_idx={partition}/'
transactions_parq = [transactions_path.format(partition=p) for p in range(transactions_partitions_cnt)]

locations_parq = 's3://bsql/data/seattle_parking/parking_locations.parquet/'

The table below helps mapping the day-of-the-week to its string representation.

In [None]:
dow = cudf.DataFrame([
          (0, 'Monday')
        , (1, 'Tuesday')
        , (2, 'Wednesday')
        , (3, 'Thursday')
        , (4, 'Friday')
        , (5, 'Saturday')
        , (6, 'Sunday')
    ], columns=['dow', 'dow_str'])

Finally, create the table.

In [None]:
%%time
bc.create_table('parking_transactions', transactions_parq)
bc.create_table('parking_locations', locations_parq)
bc.create_table('dow', dow)

# Basic information
Let's build some basic understanding of the data we're dealing with.

## Parking transactions
Let's look up the first 10 parking transactions.

In [None]:
%%time
transactions_sample = bc.sql('SELECT * FROM parking_transactions LIMIT 10')

The top 10 transactions are now in the `transactions_sample` cudf DataFrame. 

Let's look at the features of this dataset.

In [None]:
%%time
print(f'The dataset has {bc.sql("SELECT COUNT(*) FROM parking_transactions").compute().to_pandas().values.tolist()[0][0]:,} records and {transactions_sample.shape[1]} columns.')

In [None]:
print(f'List of columns: {", ".join(list(transactions_sample.columns))}.')

In [None]:
transactions_sample.dtypes

Most of the columns are self-explanatory: 

1. `OccupancyDateTime` describes when the transaction took place.
2. `PaidOccupancy` indicated the total number of spots occupied at that point in time.
3. `SourceElementKey` is the ID of the parking spot.
4. `dow` is the integer representation of the day-of-week (0 = Monday).

Here's the sample rows.

In [None]:
transactions_sample.compute()

## Parking locations
The next table holds the list of all the transactions with their metadata.

In [None]:
%%time
locations_sample = bc.sql('SELECT * FROM parking_locations LIMIT 10')

Let's look at the metadata here.

In [None]:
print(f'The dataset has {bc.sql("SELECT COUNT(*) FROM parking_locations").compute().to_pandas().values.tolist()[0][0]:,} records and {locations_sample.shape[1]} columns.')

In [None]:
print(f'List of columns: {", ".join(list(locations_sample.columns))}.')

In [None]:
locations_sample.dtypes

We have 9 columns:

1. `SourceElementKey` is the ID of the parking spot. We will use it to join with the `parking_transactions` table
2. `BlockfaceName` describes the location of the parking spot in terms of blocks (see below for an example)
3. `SideOfStreet` indicates whether the parking is on the e.g. north or south side for a street that runs from east to west
4. `ParkingTimeLimitCategory` shows the maximum allowed parking time (in minutes) at the location
5. `ParkingSpaceCount` gives the total number of parking spots available at the location
6. `PaidParkingArea` describes the broader parking area name
7. `PaidParkingSubArea` can be better understood as a city-quarter (e.g. Belltown, or Pioneer Square)
8. `ParkingCategory` indicates either a Carpool Parking, Paid Parking or RPZ (Restricted Parking Zone)
9. `Location` a point location in a WKT (Well Known Text) format (see an example below).

In [None]:
locations_sample.compute()

Clean up some duplicates

In [None]:
bc.create_table('parking_locations', bc.sql('SELECT * FROM parking_locations').drop_duplicates(subset=['SourceElementKey']))

# Featurize parking transactions
Since we'll be looking at the parking occupancy per hour of the day, per day of the week, let's extract the date features.

In [None]:
bc.create_table('parking_transactions'
    , bc.sql('''
        SELECT *
            , YEAR(OccupancyDateTime) AS o_year 
            , MONTH(OccupancyDateTime) AS o_month
            , DAYOFMONTH(OccupancyDateTime) AS o_day
            , HOUR(OccupancyDateTime) AS o_hour
        FROM parking_transactions
    ''')
)
bc.sql('SELECT * FROM parking_transactions LIMIT 10').compute()

Let's see how many transactions we get per day.

In [None]:
%%time
counts = bc.sql('''
    SELECT o_year
        , o_month
        , o_day
        , COUNT(*) AS cnt
    FROM parking_transactions
    GROUP BY o_year
        , o_month
        , o_day
    ORDER BY o_year
        , o_month
        , o_day
''')
counts.compute().to_pandas().set_index(['o_year', 'o_month', 'o_day']).plot(kind='bar', figsize=(18,9))

As you can see we get almost consistently the same daily number of transactions.

In [None]:
print('Average number of transactions per day: {0:,.0f}'.format(counts['cnt'].mean().compute()))

# Featurize parking locations
Let's now extract the latitude and longitude from the parking `Location` metadata.

In [None]:
bc.create_table('parking_locations', 
    bc.sql('''
        SELECT *
            , CAST(SUBSTRING(Location, 8, delimiter_location - 10) AS FLOAT) AS LON
            , SUBSTRING(Location, delimiter_location, A.len - delimiter_location) AS LAT
        FROM (
            SELECT *
                , CHAR_LENGTH(Location) AS len
                , CASE 
                    WHEN SUBSTRING(Location, 19, 1) = ' ' THEN 20 
                    WHEN SUBSTRING(Location, 20, 1) = ' ' THEN 21 
                    WHEN SUBSTRING(Location, 21, 1) = ' ' THEN 22
                    WHEN SUBSTRING(Location, 22, 1) = ' ' THEN 23
                  END AS delimiter_location
            FROM parking_locations 
        ) AS A
    ''')
)

# Average occupancy
Average occupancy can be higher than 100%: I think it's a data fluke. Thus, we cap it at 100% in the query below.

In [None]:
%%time
bc.create_table('parking_transactions'
    , bc.sql('''
        SELECT SourceElementKey
            , OccupancyDateTime
            , PaidOccupancy
            , ParkingSpaceCount
            , CASE WHEN AvgOccupancy > 1 THEN 1 ELSE AvgOccupancy END AS AvgOccupancy
            , dow
            , o_hour
        FROM (
            SELECT A.*
                , B.ParkingSpaceCount
                , A.PaidOccupancy / CAST(B.ParkingSpaceCount AS FLOAT) AS AvgOccupancy
            FROM parking_transactions AS A
            LEFT OUTER JOIN (SELECT SourceElementKey, ParkingSpaceCount FROM parking_locations) AS B
                ON A.SourceElementKey = B.SourceElementKey
        ) AS inner_table
    ''')
)

bc.sql('SELECT * FROM parking_transactions LIMIT 10').compute()

In [None]:
%%time
bc.create_table('means', bc.sql('''
    SELECT SourceElementKey
        , dow
        , o_hour
        , AVG(AvgOccupancy) AS MeanOccupancy
    FROM parking_transactions
    GROUP BY SourceElementKey
        , dow
        , o_hour
'''))

## Average per day-of-week and per hour

Let's see an average occupancy per day of the week, per hour of the day.

In [None]:
%%time
mean_occupancy = bc.sql('''
    SELECT A.dow
        , B.dow_str
        , A.o_hour
        , AVG(A.AvgOccupancy) AS MeanOccupancy
    FROM parking_transactions AS A
    LEFT OUTER JOIN dow AS B
        ON A.dow = B.dow
    GROUP BY A.dow
        , B.dow_str
        , A.o_hour
    ORDER BY A.dow
        , A.o_hour
''')

In [None]:
mean_occupancy.compute().to_pandas().set_index(['dow_str', 'o_hour'])['MeanOccupancy'].plot(kind='bar', figsize=(18,9))

You can clearly see the daily seasonality and the effects of the Friday night. **NOTE** Sunday is not present here as the parking in Seattle if free on Sundays.

# Find the best parking
Let's now consider a usecase: you want to come to visit Space Needle in Seattle that has the iconic view of Downtown and of the Puget Sound.

In [None]:
%%time
bc.create_table('parking_locations'
    , bc.sql('''
        SELECT *
            , 47.620422 AS LAT_Ref
            , -122.349358 AS LON_Ref
        FROM parking_locations
    ''')
)

bc.sql('SELECT * FROM parking_locations LIMIT 5').compute()

First, we'll calculate a haversine distance from the Kerry Park to each and every parking location in our dataset.

In [None]:
bc.create_table('temp', bc.sql('''
    SELECT SourceElementKey
        , LON
        , LAT
        , LON_Ref
        , LAT_Ref
        , LAT / 180.0 * 3.141592653589 AS LAT_RAD
        , LAT_Ref / 180.0 * 3.141592653589 AS LAT_REF_RAD
        , (LON_Ref - LON) / 180.0 * 3.141592653589 AS DELTA_LON
        , (LAT_Ref - LAT) / 180.0 * 3.141592653589 AS DELTA_LAT
    FROM parking_locations
'''))

bc.create_table('temp', bc.sql('''
        SELECT *
            , POWER(SIN(DELTA_LAT / 2.0),2) + COS(LAT_RAD) * COS(LAT_REF_RAD) * POWER(SIN(DELTA_LON / 2.0),2) AS A
        FROM temp
    ''')
)

bc.create_table('parking_locations', 
    bc.sql('''
        SELECT A.*
            , ASIN(SQRT(A)) * 2 * 3958.8 * 5280 AS DISTANCE_FEET
        FROM parking_locations AS A
        LEFT OUTER JOIN temp AS B
            ON A.SourceElementKey = B.SourceElementKey
    ''')
)

In [None]:
bc.drop_table('temp')

Further, let's now consider that you want to come to visit on Thursday around 5PM. Here's a list of the parking spots that are nearest to Kerry Park and give you the highest chances of actually finding a parking spot.

In [None]:
%%time
day_of_week = 4
hour_of_day = 17

bc.sql(f'''
    SELECT BlockfaceName
        , PaidParkingArea
        , ParkingCategory
        , {day_of_week} AS day_of_week
        , {hour_of_day} AS hour_of_day
        , LON
        , LAT
        , DISTANCE_FEET
        , B.MeanOccupancy
    FROM parking_locations AS A
    LEFT OUTER JOIN means AS B
        ON A.SourceElementKey = B.SourceElementKey
            AND B.dow = {day_of_week}
            AND B.o_hour = {hour_of_day}
    WHERE DISTANCE_FEET < 1000
        AND B.MeanOccupancy <= 0.5
    ORDER BY DISTANCE_FEET ASC
''').compute()

So, the nearest two parking spots are mostly located in Belltown: within 1000ft you can find 6 parking spots with quite a few open parking spots (on average).