<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Austin Bike Share
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction:</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Bike shares are becoming a popular alternative means of transportation. Suppose you had a transportation business servicing the public with various stations where they could access your transportation services. You must ensure you have equipment at the stations when the public needs them. You also know that the weather dramatically impacts the demand for your transportation services. This demonstration shows how to integrate historical trip information with weather information, leveraging Vantage Geospatial and time-series capability to improve your service and grow your business. 
<br>
The City of Austin makes data available on >649k bike trips over 2013-2017.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Contents:</b></p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Connect to Vantage</li>
    <li>Explore the data</li>
    <li>Create and Explore Temporal, Geospatial and Time index data</li>
    <li>Insights</li>
    <li>Clean up</li>

<hr style="height:2px;border:none;background-color:#00233C;">
<h1 style = 'font-size:28px;font-family:Arial;color:#00233c'><b>1. Connect to Vantage</b></h1>

In [None]:
%%capture
# '%%capture' suppresses the display of installation steps of the following packages
# !pip install folium

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>The above statements may need to be uncommented if you run the notebooks on a platform other than ClearScape Analytics Experience that does not have the libraries installed. If you uncomment those installs, be sure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
# import libraries
import warnings

warnings.filterwarnings("ignore")
warnings.simplefilter(action="ignore", category=DeprecationWarning)
warnings.simplefilter(action="ignore", category=RuntimeWarning)
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=UserWarning)
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import pandas as pd
import numpy as np
import random as rand
import getpass
import math

from teradataml import *
from teradataml.dataframe.dataframe import DataFrame
from teradataml.dataframe.dataframe import in_schema
from teradataml.context.context import create_context, remove_context, get_context
from teradataml.dataframe.copy_to import copy_to_sql
from teradataml.options.display import display

display.max_rows = 5

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import seaborn as sns
import folium
from folium import Choropleth, Circle, Marker, CircleMarker, Circle
from folium.plugins import HeatMap, MarkerCluster

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# pandas dataframe display settings
pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 1000)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press the Enter key, then <b>use down arrow</b> to go to next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=4D_Analytics_on_bike_sharing_PY_SQL.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:1px;border:none;background-color:#00233C;">
<h2 style = 'font-size:20px;font-family:Arial;color:#00233c'> <b>1.1 Getting Data for This Demo</b></h2>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. You have the option of either running the demo using foreign tables to access the data without using any storage on your environment or downloading the data to local storage which may yield somewhat faster execution, but there could be considerations of available storage. There are two statements in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_AustinBikeShare_cloud');"
 # takes about 30 seconds, estimated space: 0 MB
%run -i ../run_procedure.py "call get_data('DEMO_AustinBikeShare_local');" 
# takes about 50 seconds, estimated space: 200 MB

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Next is an optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:28px;font-family:Arial;color:#00233c'>2. Explore the data</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As a warm-up, let us look at the tables in our database TRNG_AustinBike.</p>       

In [None]:
query = """SELECT 
    DatabaseName,
    TableName
FROM
    DBC.Tables
WHERE
    DatabaseName = 'DEMO_AustinBikeShare'
"""
# pd.read_sql(query, eng)
tbl_df = DataFrame.from_query(query)
tbl_df

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can see that we have three tables in our database. The Trips table contains data on the trips taken using the bikes, the stations table has locations of the bike stations, and the weather table has details about the weather.
    <br>
    <br>
The query below shows the number of rows in each of the tables in the database.</p>

In [None]:
query = """
SELECT
(
    SELECT COUNT(*)
    FROM DEMO_AustinBikeShare.trips
) AS trips,
(
    SELECT COUNT(*)
    FROM DEMO_AustinBikeShare.stations
) AS stations,
(
    SELECT COUNT(*)
    FROM DEMO_AustinBikeShare.weather
) AS weather;
"""

# pd.read_sql(query, eng)
cnt_df = DataFrame.from_query(query)
cnt_df

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>2.1 Examine the trips table</b></p>    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let's look at the sample data in the trips table.</p>

In [None]:
query = """
SELECT
   TOP 5 *
FROM
    DEMO_AustinBikeShare.trips
;"""

# pd.read_sql(query, eng)
trips_df = DataFrame.from_query(query)
trips_df

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Which type of subscribers take most of the rides?</p> 

In [None]:
query = """
SELECT 
    top 5 count(trip_id) as ride_count, subscriber_type 
FROM DEMO_AustinBikeShare.trips 
GROUP BY subscriber_type 
ORDER BY 1 desc;"""

# pd.read_sql(query, eng)
tripcount_df = DataFrame.from_query(query)
tripcount_df.sort("ride_count", ascending=False)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From the above result we can say that <b>Walk Up</b> rides are <b>250%</b> more than second most popular subscription type.
    <br><br>
    From which station do highest number of trips start?</p>  

In [None]:
query = """
SELECT
    TOP 20
    start_station_name,
    COUNT(trip_id) AS trips
FROM
    DEMO_AustinBikeShare.trips
GROUP BY 1
ORDER BY 2 DESC;
"""

df_st_trips = DataFrame.from_query(query)
df_st_trips.sort("trips", ascending=False)

In [None]:
def get_histogram(df, x, y, title, x_title, y_title, width=1200, height=500):
    fig = px.histogram(df, x=x, y=y, title=title, nbins=df.shape[0])
    fig.update_yaxes(title=y_title)
    fig.update_xaxes(title=x_title)
    fig.update_layout(
        autosize=False,
        width=width,
        height=height,
    )
    return fig

In [None]:
df_st_trips = df_st_trips.to_pandas()
get_histogram(
    df_st_trips,
    x="start_station_name",
    y="trips",
    title="Trips by station",
    x_title="start_station_name",
    y_title="trips",
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We see that <b>Riverside @ S. Lamar</b> has the highest number of trips originating from here.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let's see average number of trips originating per from a station.</p>

In [None]:
query = """
SELECT AVG(trips) as avg_trips FROM (
    SELECT
    start_station_name,
    COUNT(1) AS trips
    FROM
        DEMO_AustinBikeShare.trips
    GROUP BY 1
) AS t;
"""

# pd.read_sql(query, eng)

df_avg_trips = DataFrame.from_query(query)
df_avg_trips

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We see that the top station <b>Riverside @ S. Lamar</b> has <b>4 times more trips</b> than the average.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now let's look at the pattern of bike usage over time. </p>    

In [None]:
query = """
SELECT
    TRUNC(start_time, 'Month') AS start_Month,
    COUNT(1) AS trips
FROM
    DEMO_AustinBikeShare.trips
GROUP BY 1
;
"""

df_trips_day = DataFrame.from_query(query)
df_trips_day.sort("start_Month")

In [None]:
df_trips_day = df_trips_day.to_pandas()
get_histogram(
    df_trips_day,
    x="start_Month",
    y="trips",
    title="Trips by day",
    x_title="start_Month",
    y_title="trips",
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the above chart we observe few things:</p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>There are two months where the data is nearly missing</li>
    <li>The peak usage month is as much as 30k trips in a month</li>
    <li>March and October are first and second busiest months across the data of 4 years.</li>
</ol>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Can this be related to the weather? Is the weather in March and October favorable for biking? Let's see this in the next section.</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>2.2 Examine the weather table</b></p>    

In [None]:
tdf = DataFrame(in_schema("DEMO_AustinBikeShare", "weather"))
tdf

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The temperature data is reported hourly (the minutes and seconds are always zero). The temperature columns are in Kelvin, which few people use to decide if it is good bicycle weather, so we will create a view over the weather table to convert the temperature to Fahrenheit. We will also average the temperature for the day.</p>

In [None]:
%%capture
query = '''
REPLACE VIEW austin_weather AS
    SELECT
        TRUNC(dt, 'Month') AS dt, 
        ROUND(AVG((temp - 273.15) * 9/5 + 32) ,0) AS AveTemp,
        SUM(CASE
                WHEN weather_main in ('Rain', 'Mist') THEN 1
                ELSE 0
            END) AS Precip_hours
    FROM DEMO_AustinBikeShare.weather
    GROUP BY 1;'''

execute_sql(query)

In [None]:
query = """
SELECT * FROM austin_weather;"""

df_temp_month = DataFrame.from_query(query)
df_temp_month.sort("dt")

In [None]:
df_temp_month = df_temp_month.to_pandas()
get_histogram(
    df_temp_month,
    x="dt",
    y="AveTemp",
    title="Average Temperature by Month",
    x_title="date",
    y_title="trips",
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>If we plot the data, we find we are missing some data, but we get an idea of the typical temperature ranges.  If we look at the hours each month when precipitation occurs, we see some patterns that could also be impacting the number of trips.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we can observe that for almost all of the March and October months, the temperature is around 70 degrees Fahrenheit. This is a favorable biking temperature as it is neither too cold nor too hot.</p>

In [None]:
get_histogram(
    df_temp_month,
    x="dt",
    y="Precip_hours",
    title="Average Precip Hours by Month",
    x_title="date",
    y_title="trips",
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From the above two charts, March and October have favorable conditions for biking, which reflects in the increased bike rides.</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>2.3 Geospatial data</b></p>    

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The Geospatial columns have a type and one or more pairs of Latitude and Longitude. We included the Latitude and Longitude columns in the table so you could see how a simple geospatial feature (a POINT) is represented.
    <br>
For more geospatial datatype supported by Teradata, please click <a href = 'https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Geospatial-Data-Types/Geospatial-Data/Geometry-Types'>here</a>.</p>

In [None]:
query = """
SELECT * FROM DEMO_AustinBikeShare.stations;"""

df_stations = DataFrame.from_query(query)
df_stations

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us analyze the info about the stations first. There are 72 stations in total, among them 56 - active, closed - 10, moved - 5 and 1 station which is active only during Austin City Limits Music Festival (ACL only). Let's plot them on the map.</p>

In [None]:
df_stations = df_stations.to_pandas()
df_stations["status"].value_counts()

In [None]:
# create folium map object
def get_folium_map_obj(
    location=[30.27186, -97.73997], zoom_start=10.5, height=700, tiles="OpenStreetMap"
):
    return folium.Map(
        location=location, zoom_start=zoom_start, height=height, tiles=tiles
    )

In [None]:
# get color dictionary for statuses
def get_color_dict():
    return {"active": "green", "ACLonly": "gray", "closed": "red", "moved": "purple"}

In [None]:
# Creating the map
map_obj = get_folium_map_obj(zoom_start=11)
clr_dict = get_color_dict()

# Adding points to the map
mc = MarkerCluster()
for idx, row in df_stations.iterrows():
    if not math.isnan(row["longitude"]) and not math.isnan(row["latitude"]):
        cn = row["name"]
        sts = row["status"]
        clr = clr_dict[sts.replace(" ", "")]
        mc.add_child(
            Marker(
                [row["latitude"], row["longitude"]],
                color=clr,
                popup="<b> Name:" + cn + "</b> <i> Status:" + sts + "</i>",
                icon=folium.Icon(icon="bicycle", prefix="fa", color=clr),
            )
        ).add_to(map_obj)
map_obj.add_child(mc)

map_obj.add_child(folium.LatLngPopup())
# view map
map_obj

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>A map with many of stations can be made simpler by using station clusters. When the map is zoomed out, nearby stations are clustered together. However, as the zoom level is increased, the clusters are broken up. Green signifies fewer than 10 stations, whereas the yellow cluster hue indicates more than 10 stations.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Click on any Station to view name of station and status like Active, Closed, Moved or ACL Only.<p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Colors of Station indicates below:</p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Green - Active</li>
    <li>Red - Closed</li>
    <li>Purple - Moved </li>
    <li>Gray - ACL Only </li>
</ol>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To filter out particular statuses like Active or closed stations: click on the left-top side legend below the map.<p>

In [None]:
# Creating the map
map_obj = get_folium_map_obj(zoom_start=12.5)
clr_dict = get_color_dict()

# define feature groups
fg_act = folium.FeatureGroup(name="Active", color="green")
fg_acl = folium.FeatureGroup(name="ACL only", color="gray")
fg_cls = folium.FeatureGroup(name="Closed", color="red")
fg_mv = folium.FeatureGroup(name="Moved", color="purple")

for idx, row in df_stations.iterrows():
    if not math.isnan(row["longitude"]) and not math.isnan(row["latitude"]):
        cn = row["name"]
        sts = row["status"]
        clr = clr_dict[sts.replace(" ", "")]
        if clr == "red":
            folium.Marker(
                [row["latitude"], row["longitude"]],
                popup="<b> Name:" + cn + "</b> <i> Status:" + sts + "</i>",
                icon=folium.Icon(icon="bicycle", prefix="fa", color=clr),
            ).add_to(fg_cls)
        elif clr == "green":
            folium.Marker(
                [row["latitude"], row["longitude"]],
                popup="<b> Name:" + cn + "</b> <i> Status:" + sts + "</i>",
                icon=folium.Icon(icon="bicycle", prefix="fa", color=clr),
            ).add_to(fg_act)
        elif clr == "gray":
            folium.Marker(
                [row["latitude"], row["longitude"]],
                popup="<b> Name:" + cn + "</b> <i> Status:" + sts + "</i>",
                icon=folium.Icon(icon="bicycle", prefix="fa", color=clr),
            ).add_to(fg_acl)
        elif clr == "purple":
            folium.Marker(
                [row["latitude"], row["longitude"]],
                popup="<b> Name:" + cn + "</b> <i> Status:" + sts + "</i>",
                icon=folium.Icon(icon="bicycle", prefix="fa", color=clr),
            ).add_to(fg_mv)

# add feature groups to the map
map_obj.add_child(fg_act)
map_obj.add_child(fg_acl)
map_obj.add_child(fg_cls)
map_obj.add_child(fg_mv)

map_obj.add_child(folium.LatLngPopup())
folium.map.LayerControl("topleft", collapsed=False).add_to(map_obj)
# view map
map_obj

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Numerous geospatial functions exist, but we can demonstrate the basics by finding the distance from the main office (station_id = 1001) to other stations.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
For more geospatial functions supported by Teradata, please click <a href = 'https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Geospatial-Data-Types'>here</a>.</p>

In [None]:
query = """
SELECT
    station.station_id, station.name, 
    station.latitude, station.longitude,
    office.latitude as ofc_lat, 
    office.longitude as ofc_lon, 
    ROUND(office.location.ST_SphericalDistance(station.location), 0) Distance_Meters
FROM DEMO_AustinBikeShare.stations station, DEMO_AustinBikeShare.stations office
WHERE office.station_id = 1001
;
"""

df_dist_frm_stn = DataFrame.from_query(query)
df_dist_frm_stn.sort("station_id")

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the below map, we can visualize the distance from the main office (station_id = 1001) to other stations. The centre point denotes the main station, and the length of the lines shows distances between the main office and station.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To view more details, hover over the stations which are showing details like Station ID, Name, and distance in meters</p>

In [None]:
# define figure
df_dist_frm_stn = df_dist_frm_stn.to_pandas()
fig = go.Figure()

# create colors list
colors = [
    "aliceblue",
    "gold",
    "goldenrod",
    "black",
    "blanchedalmond",
    "hotpink",
    "indianred",
    "indigo",
    "blue",
    "blueviolet",
    "brown",
    "burlywood",
    "cadetblue",
    "darkred",
    "darksalmon",
    "darkseagreen",
]

for i in range(1, df_dist_frm_stn.shape[0]):
    df_sub = df_dist_frm_stn[i : i + 1]
    lats = df_sub.latitude.tolist() + df_sub.ofc_lat.tolist()
    lons = df_sub.longitude.tolist() + df_sub.ofc_lon.tolist()

    fig.add_trace(
        go.Scattermapbox(
            name=str(df_sub.station_id.values[0]),
            mode="markers+lines",
            lat=lats,
            lon=lons,
            hoverinfo="text",
            hovertemplate=[
                "<b>Station ID:</b>:"
                + str(df_sub.iloc[i, 0])
                + "<br><i>Name</i>:"
                + str(df_sub.iloc[i, 1])
                + "<br><i>Dist (mt)</i>:"
                + str(df_sub.iloc[i, 6])
                for i in range(df_sub.shape[0])
            ],
            marker={"color": colors[rand.randint(1, len(colors) - 1)], "size": 10},
            opacity=0.8,
        )
    )


fig.update_layout(
    margin={"l": 0, "t": 0, "b": 0, "r": 0},
    mapbox={
        "center": {"lon": -97.73997, "lat": 30.27186},
        "style": "open-street-map",
        "zoom": 11.5,
    },
    height=600,
)
# view map
fig.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now Let's see the most frequent trip routes on the map.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The code below defines a view that one group by the location of the start and end points of the trip and populates the data with the most frequently taken route by the users. Taking the top 10 routes here.</p>

In [None]:
%%capture
query = '''
REPLACE  VIEW trips_cnt as
SELECT TOP 10 
    t1.start_station_id, 
    t1.end_station_id, 
    count(*) as cnt1 
FROM
    DEMO_AustinBikeShare.trips AS t1
GROUP BY 1,2
HAVING t1.start_station_id <> t1.end_station_id
ORDER BY cnt1 DESC
'''

execute_sql(query)

In [None]:
query = """
select 
    t.start_station_id,
    t.end_station_id,
    st.latitude as st_lat,
    st.longitude as st_lon,
    ed.latitude as ed_lat,
    ed.longitude as ed_lon,
    st.name as st_name,
    ed.name as ed_name,
    ROUND(st.location.ST_SphericalDistance(ed.location), 0) Distance_Meters,
    cnt1 as trip_counts_bw_stns
from demo_user.trips_cnt as t
LEFT JOIN DEMO_AustinBikeShare.stations AS st ON t.start_station_id = st.station_id
LEFT JOIN DEMO_AustinBikeShare.stations AS ed ON t.end_station_id = ed.station_id

"""

df_most_freq_routes = DataFrame.from_query(query)
df_most_freq_routes.sort("trip_counts_bw_stns", ascending=False)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above data suggests that the most frequently taken routes by the users are from <b>5th & Bowie to 4th & Congress</b>. It also concludes that most of these trips are between 600 meters and 1500 meters.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now, let's visualise the above routes on the map.</p>

In [None]:
# Load map centred on average coordinates
df_most_freq_routes = df_most_freq_routes.to_pandas()
my_map = map_obj = get_folium_map_obj(location=[30.26476, -97.74678], zoom_start=14.5)

colors = [
    "red",
    "goldenrod",
    "black",
    "blue",
    "blueviolet",
    "green",
    "purple",
    "darkslateblue",
    "darkslategray",
    "darkslategrey",
    "darkturquoise",
    "darkviolet",
    "deeppink",
    "deepskyblue",
]

for idx, row in df_most_freq_routes.iterrows():
    points = []
    if not math.isnan(row["st_lat"]) and not math.isnan(row["st_lon"]):
        points.append([row["st_lat"], row["st_lon"]])
        points.append([row["ed_lat"], row["ed_lon"]])
        strt = row["st_name"]
        end = row["ed_name"]
        clr = colors[rand.randint(1, len(colors) - 1)]
        folium.Marker(
            [row["st_lat"], row["st_lon"]],
            popup="<b> Start:" + strt + "</b> <br> <i> End:" + end + "</i>",
            icon=folium.Icon(icon="bicycle", prefix="fa", color="green"),
        ).add_to(my_map)
        folium.Marker(
            [row["ed_lat"], row["ed_lon"]],
            popup="<b> Start:" + strt + "</b> <br> <i> End:" + end + "</i>",
            icon=folium.Icon(icon="bicycle", prefix="fa", color="green"),
        ).add_to(my_map)
        folium.PolyLine(points, color=clr, opacity=0.8).add_to(my_map)

# view map
my_map

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above visualization suggests that Station <b>5th & Bowie and City Hall / Lavaca & 2nd</b> has the highest accessed stations as starting  and ending point respectively. Even though only ten trips originate from the highest accessed stations, it still has trip counts of more than 1500.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:28px;font-family:Arial;color:#00233c'>3. Create and Explore Temporal, Geospatial and Time index data</b>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>3.1 Create a temporal table with weather data</b></p>    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Temporal tables store and maintain information concerning time. Using temporal tables, Vantage can process statements and queries that include time-based reasoning. Temporal tables have one or two special columns which store time information:
<ul style='font-size:16px;font-family:Arial;color:#00233C'>
    <li>A transaction-time column records and maintains the period Vantage was aware of the information in the row. Vantage automatically enters and maintains the transaction-time column data and consequently tracks such information's history.</li>
    <li>A valid-time column models the real-world and stores information such as the time an insurance policy or product warranty is valid, the length of employment of an employee, or other information that is important to track and manipulate in a time-aware fashion. When you add a new row to this type of table, you use the valid-time column to specify the time period for which the row information is valid. This is the period of validity (PV) of the information in the row.</li>
</ul>
</p>

In [None]:
%%capture
query = '''
CREATE MULTISET TABLE weather_temporal (
    begin_dt      TIMESTAMP(6) NOT NULL,
    end_dt        TIMESTAMP(6) NOT NULL,
    temp          FLOAT,
    temp_min      FLOAT,
    temp_max      FLOAT,
    pressure      INTEGER,
    humidity      INTEGER,
    wind_speed    INTEGER,
    wind_deg      INTEGER,
    rain_1h       FLOAT,
    rain_3h       FLOAT,
    clouds        INTEGER,
    weather_id    INTEGER,
    weather_main  VARCHAR(50),
    weather_desc  VARCHAR(50),
    weather_icon  VARCHAR(50),
    PERIOD FOR Weather_Duration(begin_dt,end_dt) AS VALIDTIME
)
PRIMARY INDEX (weather_id);'''

execute_sql(query)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we are converting temp, temp_min, and temp_max from Kelvin to Fahrenheit while inserting the data into the weather_temporal table.</p>

In [None]:
%%capture
query = '''
INSERT INTO weather_temporal
SELECT
    dt,
    dt + INTERVAL '59' MINUTE + INTERVAL '59' SECOND,
    round( ((temp - 273.15) * 9/5 + 32 ) ,0),
    round( ((temp_min - 273.15) * 9/5 + 32 ) ,0),
    round( ((temp_max - 273.15) * 9/5 + 32 ) ,0),
    pressure,
    humidity,
    wind_speed,
    wind_deg,
    rain_1h,
    rain_3h,
    clouds,
    weather_id,
    weather_main,
    weather_desc,
    weather_icon
FROM 
    DEMO_AustinBikeShare.weather;'''
    
execute_sql(query)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Sequenced temporal queries allow the extraction of the past, current, or future sequence of states of a temporal table. A query that is sequenced in valid time spans those rows with a period of validity that overlaps the period of applicability of the query. Additional conditions can be specified on the valid-time column to further filter the rows as required.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
For more detials on Sequenced Valid-Time functions supported by Teradata, please click <a href = 'https://docs.teradata.com/search/all?query=Sequenced+Valid-Time+Queries&content-lang=en-US'>here</a>.</p>

In [None]:
query = """
SEQUENCED VALIDTIME SELECT * FROM weather_temporal SAMPLE 10;
"""

df_weather = DataFrame.from_query(query)
df_weather

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now we can efficiently answer time-based reasoning queries faster and efficiently with Temporal tables. For example, was the weather favorable to biking in March and October 2016?</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The valid-time period is also known as the period of validity (PV) of the row. Valid-time columns are defined by specifying AS VALIDTIME in the column definition, and have a period data type with an element type of DATE or TIMESTAMP(n) (optionally including WITH TIME ZONE).</p>

In [None]:
query = """
SELECT
    COUNT(weather_main) AS weather_hours, weather_main
FROM (
    VALIDTIME PERIOD '(2016-03-01, 2016-03-31)'
    SELECT * FROM weather_temporal
) AS dt
GROUP BY weather_main;
"""

df_dur_wather_type_march = DataFrame.from_query(query)
df_dur_wather_type_march

In [None]:
df_dur_wather_type_march = df_dur_wather_type_march.to_pandas()
get_histogram(
    df_dur_wather_type_march,
    x="weather_main",
    y="weather_hours",
    title="Duration(in hours) of weather by weather type(for March 2016)",
    x_title="weather_main",
    y_title="weather_hours",
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above graphs suggest that March 2016 had more days favourable for biking(clear, cloudy, mist), hence explaining the increased number of bike rides.</p>

In [None]:
query = """
SELECT
    COUNT(weather_main) AS weather_hours, weather_main
FROM (
        VALIDTIME PERIOD '(2016-10-01, 2016-10-30)'
        SELECT * FROM weather_temporal
    ) AS dt
GROUP BY weather_main;
"""
df_dur_wather_type_oct = DataFrame.from_query(query)
df_dur_wather_type_oct

In [None]:
df_dur_wather_type_oct = df_dur_wather_type_oct.to_pandas()
get_histogram(
    df_dur_wather_type_oct,
    x="weather_main",
    y="weather_hours",
    title="Duration(in hours) of weather by weather type(for October 2016)",
    x_title="weather_main",
    y_title="weather_hours",
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above graphs suggest that October 2016 had mostly 3 main weather conditions (clear, rain, and clouds), and out of these 2 weather conditions (clear and clouds), there were more days favourable for biking, hence the increased number of bike rides.</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>3.2 Create a view for all trips with start/end stations data and a GEOSEQUENCE with start/end lat/long/time</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The code below defines a view which enhances the trip data with a Geosequence field containing the location and time for the start and end points of the trip.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As there are 0.6 million records in the trips table and we are using left joins twice, it would take a little bit of time to execute the query.</p>

In [None]:
%%capture
query = '''
REPLACE VIEW trips_geo AS
SELECT
    t.bikeid,
    t.trip_ID,
    t.subscriber_type,
    t.start_station_id,
    COALESCE(t.start_station_name, st.NAME) AS start_station_name,
    t.start_time,
    st.status starting_station_status,
    t.end_station_id,
    COALESCE(t.end_station_name, ed.NAME) AS end_station_name,
    t.start_time 
        + CAST(t.duration_minutes/60 AS INTERVAL HOUR(4)) 
        + CAST(t.duration_minutes MOD 60 AS INTERVAL MINUTE(4)) AS end_time,
    ed.status AS End_station_status,
    t.duration_minutes,
    NEW ST_GEOMETRY('ST_POINT' ,st.Longitude, st.Latitude) AS start_location,
    NEW ST_GEOMETRY('ST_POINT' ,ed.Longitude, ed.Latitude) AS end_location,
    CAST('GEOSEQUENCE( ('
        || COALESCE(st.Longitude,-98.272797)
        || ' '
        || COALESCE(st.Latitude,30.578245)
        || ','
        || COALESCE(ed.longitude,-98.272797)
        || ' '
        || COALESCE(ed.latitude,30.578245)
        || '), ('
        || CAST(CAST(t.start_time AS FORMAT 'yyyy-mm-ddbhh:mi:ss') AS VARCHAR(50))
        || ','
        || CAST(CAST(end_time AS FORMAT 'yyyy-mm-ddbhh:mi:ss') AS VARCHAR(50))
        || '), ('
        || '1,2), (0) )' AS ST_GEOMETRY) AS GEOM
FROM
    DEMO_AustinBikeShare.trips AS t
    LEFT JOIN DEMO_AustinBikeShare.stations AS st ON t.start_station_id = st.station_id
    LEFT JOIN DEMO_AustinBikeShare.stations AS ed ON t.end_station_id = ed.station_id;'''

execute_sql(query)

In [None]:
query = """
SELECT TOP 5 * FROM trips_geo;"""

df_trips_geo = DataFrame.from_query(query)
df_trips_geo

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>3.3 Create a Time Index table of the trips to accelerate time related analysis</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Vantage supports tables with a Primary Time Index (PTI), which is used to store and quickly look up data that arrives based on time. This time-aware index distributes data across the units of parallelism. Still, it allows the optimizer to build plans which go directly to the unit of parallelism where the data is stored based on the time constraint.<br><br>
In this case, we will declare the index to have hourly granularity with a baseline time earlier than any date of data we have. Based on the primary index declaration, the database automatically creates the first column with the name TD_TIMECODE. When we insert data, we will use the start_time column as that value.</p>

In [None]:
%%capture
query = '''
CREATE MULTISET TABLE trips_geo_pti (
    bikeid                    INTEGER,
    trip_id                   BIGINT,
    subscriber_type           VARCHAR(50),
    start_station_id          INTEGER,
    start_station_name        VARCHAR(100),
    starting_station_status   VARCHAR(50),
    end_station_id            INTEGER,
    end_station_name          VARCHAR(100),
    end_time                  TIMESTAMP(6),
    end_station_status        VARCHAR(50),
    duration_minutes          INTEGER,
    geom                      SYSUDTLIB.ST_GEOMETRY(16776192) INLINE LENGTH 9920
)
PRIMARY TIME INDEX (TIMESTAMP(6), DATE '2013-12-20', MINUTES(60));'''

execute_sql(query)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We now populate the local table.  This could take a minute to get data from the cloud storage.</p>

In [None]:
%%capture
query = '''
INSERT INTO trips_geo_pti
SELECT
    start_time,
    bikeid,
    trip_id,
    subscriber_type,
    start_station_id,
    start_station_name,
    starting_station_status,
    end_station_id,
    end_station_name,
    end_time,
    End_station_status,
    duration_minutes,
    geom
FROM
    trips_geo;'''
    
execute_sql(query)

In [None]:
tdf_trips_geo_pti = DataFrame(in_schema("demo_user", "trips_geo_pti"))
tdf_trips_geo_pti

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we are creating a new field geom from the start and end station latitude and longitude, along with a timestamp. In the above result, we are also calculating the duration in minutes from the start of the trips to the end of the trips.</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>3.4 Augment trips data with weather data and extract geospatial information</b></p> 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Finally, we bring the data together with the geosequenced trip information with the available weather data, where the weather report period contains the trip's start time (TD_TIMECODE).</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
For more geospatial functions supported by Teradata, please click <a href = 'https://https://docs.teradata.com/r/Geospatial-Data-Types/June-2020'>here</a>.</p>

In [None]:
%%capture
query = '''
CREATE MULTISET TABLE trips_and_weather AS (
    SELECT 
        t.start_station_name,
        t.end_station_name,
        t.bikeid,
        t.trip_id,
        t.subscriber_type as subscriber_type,
        t.geom.GetInitT() AS pickup_time,
        t.geom.GetFinalT() AS dropoff_time,
        t.geom.ST_POINTN(1).ST_SPHEROIDALDISTANCE(geom.ST_POINTN(2))/1000 AS total_distance,
        t.geom.ST_POINTN(1).ST_X() AS pickup_location_lon,
        t.geom.ST_POINTN(1).ST_Y() AS pickup_location_lat,
        t.geom.ST_POINTN(2).ST_X() AS dropoff_location_lon,
        t.geom.ST_POINTN(2).ST_Y() AS dropoff_location_lat,        
        t.duration_minutes,
        t.TD_TIMECODE as Trip_TIMECODE,
        wt.*
    FROM 
        trips_geo_pti AS t
        INNER JOIN Weather_temporal AS wt ON wt.weather_duration CONTAINS t.TD_TIMECODE
        AND pickup_time >= '2017-07-01 00:00:00'
)
WITH DATA primary index(trip_id);'''

execute_sql(query)

In [None]:
query = """
SELECT TOP 5 * FROM trips_and_weather WHERE CAST(pickup_time AS DATE) BETWEEN '2017-07-01' AND '2017-07-31'
"""
df_trips_weather = DataFrame.from_query(query)
df_trips_weather

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:28px;font-family:Arial;color:#00233c'>4. Insights</b>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>4.1 Average distance traveled w.r.t start stations</b></p>   

In [None]:
query = """
SELECT
    start_station_name, AVG(total_distance) avg_tot_dist, COUNT(trip_id) cnt_trips
FROM trips_and_weather
GROUP BY start_station_name
;"""

df_avg_dist = DataFrame.from_query(query)
df_avg_dist.sort("cnt_trips", ascending=False)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above visualization suggests that Main Office has the highest average distance people travel. Even though only ten trips originate from the main station, it still has the highest average distance traveled. These ten trips are very long.</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>4.2 Effect of weather on distance travelled</b></p>   

In [None]:
query = """
SELECT
    TOP 5 SUM(total_distance) AS distance_km, subscriber_type, weather_main
FROM trips_and_weather
GROUP BY subscriber_type, weather_main
;"""

df_tot_dist = DataFrame.from_query(query)
df_tot_dist.sort("distance_km", ascending=False)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Looking at the results above, walk-up, local365, and local30 subscribers traveled more distance when the weather was clear or cloudy.</p>
<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>4.3 Average trip duration w.r.t subscriber type and trip type</b></p>   

In [None]:
query = """
SELECT
    subscriber_type,
    CASE
        WHEN start_station_name = end_station_name THEN 'Round_Trip'
        ELSE 'Point-to-Point'
    END AS trip_type,
    AVG(duration_minutes) AS time_mins
FROM trips_and_weather
GROUP BY subscriber_type, trip_type
;"""

df_avg_duration = DataFrame.from_query(query)
df_avg_duration.sort("time_mins", ascending=False)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Looking at the results above, round trips have longer trips than point-to-point for the explorer, walk up and annual members.</p>
<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>4.4 Does the bike require maintenance?</b></p>   

In [None]:
query = """
SELECT
    bikeid, COUNT(*) AS num_trips, sum(total_distance) AS distance,
    CASE
        WHEN distance > 70 THEN 'Recommended'
        ELSE 'Not Required'
    END AS maintenance
FROM trips_and_weather
GROUP BY bikeid
; """

df_maintenance = DataFrame.from_query(query)
df_maintenance.sort("distance", ascending=False)

In [None]:
df_maintenance = df_maintenance.to_pandas()
get_histogram(
    df_maintenance.groupby(["maintenance"]).size().reset_index(name="counts"),
    x="maintenance",
    y="counts",
    title="Maintenance Required?",
    x_title="maintenance",
    y_title="counts",
    width=600,
    height=500,
)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Looking at the results above, 50 bikes require maintenance according to our assumption that we should do bike repairs after every 70 kms.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:28px;font-family:Arial;color:#00233c'>5. Clean up</b>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>5.1 Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Cleanup work tables to prevent errors next time. This section drops all the tables created during the demonstration.</p>

In [None]:
tables = ["weather_temporal", "trips_geo_pti", "trips_and_weather", "trips_geo"]

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table, schema_name="demo_user")
    except:
        pass

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>5.2 Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_AustinBikeShare');" 

In [None]:
remove_context()

<p style = 'font-size:16px;font-family:Arial;color:#00233c'><b>Links:</b></p>
<ul style='font-size:16px;font-family:Arial;color:#00233C'>
    <li>Information about Geospatial datatype can be found <a href = 'https://docs.teradata.com/search/all?query=geospatial&content-lang=en-US'>here</a></li>
    <li>Information about Temporal datatype can be found <a href = 'https://docs.teradata.com/search/all?query=temporal&content-lang=en-US'>here</a></li>
</ul>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023. All Rights Reserved
        </div>
    </div>
</footer>