<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>4D Analytics using the New York City Taxi dataset --Geospatial & Visualizations</b>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial'>
Businesses that involve different locations and moving things from place to place where they are needed can benefit from geospatial analysis.  Questions include capacity at given locations and capacity to move things from where they are to where they are needed.  A common optimization question is keeping the assets used in movement in constant use.  A major railroad saved millions of dollars using Vantage to reduce the idle time of engines waiting in the rail yard.  A commonly used set of data for this type of analysis is the NYC Taxi data which has trips with cabs (medallions), pickup/drop-off time and geospatial coordinates. , passenger count and fares. Using geospatial analysis, decisions can be made about deploying cabs and vehicles with different capacities.<br>Vantage provides support for geospatial shapes (points, lines, curves, polygon, etc) and methods for analyzing the relationships of those shapes ( Distance, length, overlaps, contains, touches, etc).  In this demo, we’ll be using:</p>
    <ul style = 'font-size:16px;font-family:Arial'>
<li>Geospatial using the ST_GEOMETRY data type and ST_SphericalDistance method</li>
        <li>The visualizations are done using Python plotly module</li>
               </ul>
    </p>
    

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>1. Import python packages, connect to Vantage and explore the dataset</b></p>

In [None]:
import getpass
import warnings

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

from teradataml.dataframe.dataframe import DataFrame
from teradataml.analytics.sqle import NGramSplitter
from teradataml.dataframe.dataframe import in_schema
from teradataml.context.context import create_context, remove_context, get_context
from teradataml.dataframe.copy_to import copy_to_sql
from teradataml.options.display import display

from teradatasqlalchemy.types import *

import matplotlib.pyplot as plt

%matplotlib inline

warnings.filterwarnings('ignore')


<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell.</p>

<p style = 'font-size:16px;font-family:Arial'>Below command will make a connection to the Vantage environment.

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
eng.execute('''SET query_band='DEMO=NYC-taxi-geospatial-visual.ipynb;' UPDATE FOR SESSION; ''')


<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys.</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. In this demo since we are using Temporal table we will be creating databases and tables in local storage and use them in the notebook. Please execute the procedure in the next cell.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_NYCTaxi_cloud');"
 # takes about 25 seconds, estimated space: 0 MB
#%run -i ../run_procedure.py "call get_data('DEMO_NYCTaxi_local');" 
# takes about 1minute 20 seconds, estimated space: 70 MB

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b> Access data in Vantage  </b> </p>
<p style = 'font-size:16px;font-family:Arial'>Let us check the data sample. This demonstration will use two tables: the taxi trip details and the fares for each trip. The queries below will sample each table and then show the range of the time period covered by the data. </p>

In [None]:
qry = ''' 
SELECT top 10 * from DEMO_NYCTaxi.Trip;
'''

res = pd.read_sql(qry, eng)

res.head()

In [None]:
qry = '''
SELECT top 10 * from DEMO_NYCTaxi.Trip_Fare;
'''

res = pd.read_sql(qry, eng)

res.head()

In [None]:
qry = ''' 
sel min(pickup_datetime), max(dropoff_datetime) from DEMO_NYCTaxi.Trip;
'''

res = pd.read_sql(qry, eng)

res.head()

   <p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>2. Geospatial Analysis </b></p>
   <p style = 'font-size:16px;font-family:Arial'> Now we have seen the trip and fare details, Let's define a few landmarks. </p>

In [None]:
qry = ''' 
CREATE VOLATILE TABLE dim_geo_locations
     (
      location VARCHAR(100),
      Lat FLOAT,
      Lon FLOAT,
      geo_point SYSUDTLIB.ST_GEOMETRY(16776192) INLINE LENGTH 9920)
PRIMARY INDEX ( location )
ON COMMIT PRESERVE ROWS;
'''

eng.execute(qry)


In [None]:
qry = ''' 
insert into dim_geo_locations values('Columbia University',40.81,-73.96,'POINT(40.81 -73.96)')
;insert into dim_geo_locations values('Empire State Building',40.75,-73.99,'POINT(40.75 -73.99)')
;insert into dim_geo_locations values('Grand Central Station',40.75,-73.98,'POINT(40.75 -73.98)')
;insert into dim_geo_locations values('JFK Airport',40.64,-73.79,'POINT(40.64 -73.79)')
;insert into dim_geo_locations values('Madison Square Garden',40.75,-73.99,'POINT(40.75 -73.99)')
;insert into dim_geo_locations values('New York Stock Exchange',40.71,-74.01,'POINT(40.71 -74.01)')
;insert into dim_geo_locations values('Times Square',40.76,-73.99,'POINT(40.76 -73.99)')
;insert into dim_geo_locations values('United Nations HQ',40.75,-73.97,'POINT(40.75 -73.97)')
;insert into dim_geo_locations values('Yankee Stadium',40.83,-73.93,'POINT(40.83 -73.93)');
'''

eng.execute(qry)

In [None]:
qry = ''' 
sel * from dim_geo_locations;
'''

res = pd.read_sql(qry, eng)

res.head()

<p style = 'font-size:16px;font-family:Arial'> As you can see the dim_geo_locations contain a separate Latitude and Longitude column and a "Well Known Text" (WKT) geospatial column with a POINT defined. The supported shape types also have user defined type. Those available are:
<ul style = 'font-size:16px;font-family:Arial'>
    <li>ST_CircularString</li>
    <li>ST_GeomCollection</li>
    <li>ST_MultiLineString</li>
    <li>ST_Point</li>
    <li>ST_CompoundCurve</li>
    <li>ST_Geometry</li>
    <li>ST_MultiPoint</li>
    <li>ST_Polygon</li>
    <li>ST_Curve</li>
    <li>ST_LineString</li>
    <li>ST_MultiPolygon</li>
    <li>ST_Surface</li>
    <li>ST_CurvePolygon</li>
    <li>ST_MultiCurve</li>
    <li>ST_MultiSurface</li></ul>


<p style = 'font-size:16px;font-family:Arial'> Now let's plot these locations on the map.

In [None]:
#Dim geo locations
geo = pd.read_sql("select location,lat,lon from dim_geo_locations ;",eng)
fig1 = px.scatter_mapbox(geo, lat="Lat", lon="Lon", hover_name="location", 
                        color_discrete_sequence=["red"], zoom=11, height=400)
fig1.update_layout(mapbox_style="open-street-map")
fig1.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig1.show()

<p style = 'font-size:16px;font-family:Arial'>
Sometimes the source data you get from other sources may not have been developed with geospatial data types but separate latitude and longitude as seen in Trips table above. Ideally data would be stored with geospatial data types but we can cast them as geometry datatype. To simplify the code, we will create a function that will create a WKT representation of the point

In [None]:
qry = ''' 
REPLACE FUNCTION Make_Geometry(latitude FLOAT, longitude FLOAT) RETURNS ST_GEOMETRY
LANGUAGE SQL CONTAINS SQL COLLATION INVOKER INLINE TYPE 1
RETURN CAST('POINT('|| TRIM(latitude (DECIMAL(15,6))) || ' ' || TRIM(longitude (DECIMAL(15,6))) || ')' AS ST_GEOMETRY);
'''

eng.execute(qry)

 
 <p style = 'font-size:16px;font-family:Arial'>
    Here are the coordinates for the JFK airport, and this is a point:

In [None]:
qry = '''
SELECT Make_Geometry(40.64, -73.79);
'''
res = pd.read_sql(qry, eng)

res.head()

<p style = 'font-size:16px;font-family:Arial'> Let's find out the drop locations for taxis staring from JFK Airport.

In [None]:
qry = '''
select 
r.dropoff_latitude
,r.dropoff_longitude
,count(*)
from DEMO_NYCTaxi.Trip r
where  cast(pickup_latitude as decimal(10,2))= 40.64
and cast(pickup_longitude as decimal(10,2)) = -73.79 
group by 1,2;
'''
drop_loc = pd.read_sql(qry, eng)

drop_loc.head()

<p style = 'font-size:16px;font-family:Arial'> Let's see these drop points on the map.

In [None]:
fig2 = px.scatter_mapbox(drop_loc, lat="dropoff_latitude", lon="dropoff_longitude", 
                        color_discrete_sequence=["blue"], zoom=11, height=400)
fig2.update_layout(mapbox_style="open-street-map")
fig2.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig2.show()

<p style = 'font-size:16px;font-family:Arial'> Let's visualize these drop locations with the landmark locations we have created before.

In [None]:
fig3 = go.Figure(data=fig1.data + fig2.data ,layout = fig1.layout)
fig3.show()

<p style = 'font-size:16px;font-family:Arial'> Vantage has many inbuilt geospatial functions e.g ST_Spherical_Distance which calculates the distance between two points. For complete list of geospatial functions you can check the reference link.

<p style = 'font-size:16px;font-family:Arial'> Now let's filter the rides staring from JFK Airport where drop off point is within 0.5km of any landmark

In [None]:
qry = '''
SELECT geoloc.location
,trip.dropoff_latitude
,trip.dropoff_longitude
FROM DEMO_NYCTaxi.trip trip
join dim_geo_locations geoloc
on geoloc.geo_point.ST_SphericalDistance(make_geometry(trip.dropoff_latitude, trip.dropoff_longitude)) < 500
where  cast(pickup_latitude as decimal(10,2))= 40.64
and cast(pickup_longitude as decimal(10,2)) = -73.79 
group by 1,2,3
'''
drop_loc2 = pd.read_sql(qry, eng)

drop_loc2.head()

In [None]:
fig4 = px.scatter_mapbox(drop_loc2, lat="dropoff_latitude", lon="dropoff_longitude", hover_name="location", 
                        color_discrete_sequence=["blue"], zoom=11, height=400)
fig4.update_layout(mapbox_style="open-street-map")
fig4.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

fig5 = go.Figure(data=fig1.data + fig4.data ,layout = fig1.layout)
fig5.show()

<p style = 'font-size:16px;font-family:Arial'> Which landmark has the highest number of pickup points within 0.5KM

In [None]:
qry = '''
SELECT geoloc.location
,geoloc.lat
,geoloc.lon
,count(1) pickup_cnt
FROM DEMO_NYCTaxi.trip trip
join dim_geo_locations geoloc
on geoloc.geo_point.ST_SphericalDistance(make_geometry(trip.pickup_latitude, trip.pickup_longitude))  < 500
group by 1,2,3
'''
pickup = pd.read_sql(qry, eng)

pickup.head()

In [None]:
fig6 = px.scatter_mapbox(pickup, lat="Lat", lon="Lon", hover_name="location", 
                        color="pickup_cnt",size="pickup_cnt", zoom=11, height=400)
fig6.update_layout(mapbox_style="open-street-map")
fig6.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig6.show()

<p style = 'font-size:16px;font-family:Arial'> What is the number of pickup at 'JFK Airport' throughout the day? As we have earlier from min & max pickup datetime we have only one day's data hence we don't need any date filter.

In [None]:
qry = '''
select
$TD_TIMECODE_RANGE time_bucket_per_hr
,l.location
,count(1) pickup_cnt
from DEMO_NYCTaxi.Trip r
join dim_geo_locations l
on l.geo_point.ST_SphericalDistance(make_geometry(r.pickup_latitude, r.pickup_longitude)) < 1000
group by time(minutes(60) and l.location)
USING TIMECODE(pickup_datetime)
where  l.location='JFK Airport'
order by 2,1;
'''
hr_throughput = pd.read_sql(qry, eng)

hr_throughput.head()

In [None]:
hr_throughput.plot(kind = 'bar', legend = True, figsize = (12, 9))

<p style = 'font-size:16px;font-family:Arial'> From the chart above we can see that the pickup demand at airport is higher from 3-10PM and very low from 2-5am.

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>3.  Clean up </b></p>
<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'> <b>Worktables </b></p>

In [None]:
eng.execute('DROP TABLE dim_geo_locations;') 


In [None]:
eng.execute('DROP FUNCTION make_geometry;') 

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'> <b>Database and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_NYCTaxi');" 
#Takes 10 seconds takes about 10 seconds, optional if you want to use the data later
#note that the same database & tables are used in Usecases/NYC-taxi-4d/NYC-taxi-4d.ipynb 

<p style = 'font-size:28px;font-family:Arial;color:#E37C4D'><b>4. Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial'>
In this demonstration we have seen Vantage can store common geometry datatypes like point, linestring etc in ST_GEOMETRY datatype and has inbuild functions which are fairly simple and easy to use. For more information on the geometry datatype and functions please refer to link below. 

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Reference Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
        <li>Teradata® Geospatial Utilities User Guide: <a href = 'https://docs.teradata.com/r/Teradata-Geospatial-Utilities-User-Guide/June-2022/Teradata-Geospatial-Utilities-Overview/Welcome-to-Teradata-Tools-and-Utilities-Teradata-Geospatial-Utilities-User-Guide'>https://docs.teradata.com/r/Teradata-Geospatial-Utilities-User-Guide/June-2022/Teradata-Geospatial-Utilities-Overview/Welcome-to-Teradata-Tools-and-Utilities-Teradata-Geospatial-Utilities-User-Guide</a></ul>

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">©2023 Teradata. All Rights Reserved</footer>