# Data Visualization

The RAPIDS AI ecosystem and `cudf.DataFrame` are built on a series of standards that simplify interoperability with established and emerging data science tools.

With a growing number of libraries adding GPU support, and a `cudf.DataFrame`’s ability to convert `.to_pandas()`, a large portion of the Python Visualization ([PyViz](pyviz.org/tools.html)) stack is immediately available to display your data. 

In this Notebook, we’ll walk through some of the data visualization possibilities with BlazingSQL. 

Blog post: [Data Visualization with BlazingSQL](https://blog.blazingdb.com/data-visualization-with-blazingsql-12095862eb73?source=friends_link&sk=94fc5ee25f2a3356b4a9b9a49fd0f3a1)

#### Overview 
- [Matplotlib](#Matplotlib)
- [Datashader](#Datashader)
- [HoloViews](#HoloViews)
- [cuxfilter](#cuxfilter)

In [None]:
from blazingsql import BlazingContext
bc = BlazingContext()

### Dataset

The data we’ll be using for this demo comes from the [NYC Taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) and is stored in a public AWS S3 bucket.

In [None]:
bc.s3('blazingsql-colab', bucket_name='blazingsql-colab')

bc.create_table('taxi', 's3://blazingsql-colab/yellow_taxi/taxi_data.parquet')

Let's give the data a quick look to get a clue what we're looking at.

In [None]:
bc.sql('select * from taxi').tail()

## Matplotlib 

[GitHub](https://github.com/matplotlib/matplotlib)

> _Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python._

By calling the `.to_pandas()` method, we can convert a `cudf.DataFrame` into a `pandas.DataFrame` and instantly access Matplotlib with `.plot()`.

For example, **does the `passenger_count` influence the `tip_amount`?**

In [None]:
bc.sql('SELECT * FROM taxi').to_pandas().plot(kind='scatter', x='passenger_count', y='tip_amount')

Other than the jump from 0 to 1 or outliers at 5 and 6, having more passengers might not be a good deal for the driver's `tip_amount`.

Let's see what demand is like. Based on dropoff time, **how many riders were transported by hour?** i.e. column `7` will be the total number of passengers dropped off from 7:00 AM through 7:59 AM for all days in this time period.

In [None]:
riders_by_hour = '''
                 select
                     sum(passenger_count) as sum_riders,
                     hour(cast(tpep_dropoff_datetime || '.0' as TIMESTAMP)) as hour_of_the_day
                 from
                     taxi
                 group by
                     hour(cast(tpep_dropoff_datetime || '.0' as TIMESTAMP))
                 order by
                     hour(cast(tpep_dropoff_datetime || '.0' as TIMESTAMP))
                     '''
bc.sql(riders_by_hour).to_pandas().plot(kind='bar', x='hour_of_the_day', y='sum_riders', title='Sum Riders by Hour', figsize=(12, 6))

Looks like the morning gets started around 6:00 AM, and builds up to a sustained lunchtime double peak from 12:00 PM - 3:00 PM. After a quick 3:00 PM - 5:00 PM siesta, we're right back for prime time from 6:00 PM to 8:00 PM. It's downhill from there, but tomorrow is a new day!

In [None]:
solo_rate = len(bc.sql('select * from taxi where passenger_count = 1')) / len(bc.sql('select * from taxi')) * 100

print(f'{solo_rate}% of rides have only 1 passenger.')

The overwhelming majority of rides have just 1 passenger. How consistent is this solo rider rate? **What's the average `passenger_count` per trip by hour?** 

And maybe time of day plays a role in `tip_amount` as well, **what's the average `tip_amount` per trip by hour?**

We can run both queries in the same cell and the results will display inline.

In [None]:
xticks = [n for n in range(24)]

avg_riders_by_hour = '''
                     select
                         avg(passenger_count) as avg_passenger_count,
                         hour(dropoff_ts) as hour_of_the_day
                     from (
                         select
                             passenger_count, 
                             cast(tpep_dropoff_datetime || '.0' as TIMESTAMP) dropoff_ts
                         from
                             taxi
                             )
                     group by
                         hour(dropoff_ts)
                     order by
                         hour(dropoff_ts)
                         '''
bc.sql(avg_riders_by_hour).to_pandas().plot(kind='line', x='hour_of_the_day', y='avg_passenger_count', title='Avg. # Riders per Trip by Hour', xticks=xticks, figsize=(12, 6))

avg_tip_by_hour = '''
                  select
                      avg(tip_amount) as avg_tip_amount,
                      hour(dropoff_ts) as hour_of_the_day
                  from (
                      select
                          tip_amount, 
                          cast(tpep_dropoff_datetime || '.0' as TIMESTAMP) dropoff_ts
                      from
                          taxi
                          )
                  group by
                      hour(dropoff_ts)
                  order by
                      hour(dropoff_ts)
                      '''
bc.sql(avg_tip_by_hour).to_pandas().plot(kind='line', x='hour_of_the_day', y='avg_tip_amount', title='Avg. Tip ($) per Trip by Hour', xticks=xticks, figsize=(12, 6))

Interestingly, they almost resemble each other from 8:00 PM to 9:00 AM, but where average `passenger_count` continues to rise until 3:00 PM, average `tip_amount` takes a dip until 3:00 PM. 

From 3:00 PM - 8:00 PM average `tip_amount` starts rising and average `passenger_count` waits patiently for it to catch up.

Average `tip_amount` peaks at midnight, and bottoms out at 5:00 AM. Average `passenger_count` is highest around 3:00 AM, and lowest at 6:00 AM.

## Datashader
    
[GitHub](https://github.com/holoviz/datashader)

> Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

As of [holoviz/datashader#793](https://github.com/holoviz/datashader/pull/793), the following Datashader features accept `cudf.DataFrame` and `dask_cudf.DataFrame` input:

- `Canvas.points`, `Canvas.line` and `Canvas.area` rasterization
- All reduction operations except `var` and `std`. 
- `transfer_functions.shade` (both 2D and 3D) inputs

#### Colorcet

[GitHub](https://github.com/holoviz/colorcet)

> Colorcet is a collection of perceptually uniform colormaps for use with Python plotting programs like bokeh, matplotlib, holoviews, and datashader based on the set of perceptually uniform colormaps created by Peter Kovesi at the Center for Exploration Targeting.

In [None]:
from datashader import Canvas, transfer_functions as tf
from colorcet import fire

**Do dropoff locations change based on the time of day?** Let's say 6AM-4PM vs 6PM-4AM.

Dropoffs from 6:00 AM to 4:00 PM

In [None]:
query = '''
        select 
            dropoff_x, dropoff_y 
        from 
            taxi 
            where  
                hour(cast(tpep_pickup_datetime || '.0' as TIMESTAMP)) BETWEEN 6 AND 15
                '''
nyc = Canvas().points(bc.sql(query), 'dropoff_x', 'dropoff_y')
tf.set_background(tf.shade(nyc, cmap=fire), "black")

Dropoffs from 6:00 PM to 4:00 AM

In [None]:
query = '''
        select 
            dropoff_x, dropoff_y 
        from 
            taxi 
            where  
                hour(cast(tpep_pickup_datetime || '.0' as TIMESTAMP)) BETWEEN 18 AND 23
                OR hour(cast(tpep_pickup_datetime || '.0' as TIMESTAMP)) BETWEEN 0 AND 3
                '''
nyc = Canvas().points(bc.sql(query), 'dropoff_x', 'dropoff_y')
tf.set_background(tf.shade(nyc, cmap=fire), "black")

While Manhattan makes up the majority of the dropoff geography from 6:00 AM to 4:00 PM, Midtown's spark grows and spreads deeper into Brooklyn and Queens in the 6:00 PM to 4:00 AM window. 

Consistent with the more decentralized look across the map, dropoffs near LaGuardia Airport (upper-middle right side) also die down relative to surrounding areas as the night rolls in.

## HoloViews 

[GitHub](https://github.com/holoviz/holoviews)

> HoloViews is an open-source Python library designed to make data analysis and visualization seamless and simple. With HoloViews, you can usually express what you want to do in very few lines of code, letting you focus on what you are trying to explore and convey, not on the process of plotting.

By calling the `.to_pandas()` method, we can convert a `cudf.DataFrame` into a `pandas.DataFrame` and hand off to HoloViews or other CPU visualization packages.

In [None]:
from holoviews import extension, opts
from holoviews import Scatter, Dimension
import holoviews.operation.datashader as hd

extension('bokeh')
opts.defaults(opts.Scatter(height=425, width=425), opts.RGB(height=425, width=425))

cmap = [(49,130,189), (107,174,214), (123,142,216), (226,103,152), (255,0,104), (50,50,50)]

With HoloViews, we can easily explore the relationship of multiple scatter plots by saving them as variables and displaying them side-by-side with the same code cell.

For example, let's reexamine `passenger_count` vs `tip_amount` next to a new `holoviews.Scatter` of `fare_amount` vs `tip_amount`.

**Does `passenger_count` affect `tip_amount`?**

In [None]:
s = Scatter(bc.sql('select passenger_count, tip_amount from taxi').to_pandas(), 'passenger_count', 'tip_amount')

# 0-6 passengers, $0-$100 tip
ranged = s.redim.range(passenger_count=(-0.5, 6.5), tip_amount=(0, 100))
shaded = hd.spread(hd.datashade(ranged, x_sampling=0.25, cmap=cmap))

riders_v_tip = shaded.redim.label(passenger_count="Passenger Count", tip_amount="Tip ($)")

**How do `fare_amount` and `tip_amount` relate?**

In [None]:
s = Scatter(bc.sql('select fare_amount, tip_amount from taxi').to_pandas(), 'fare_amount', 'tip_amount')

# 0-30 miles, $0-$60 tip
ranged = s.redim.range(fare_amount=(0, 100), tip_amount=(0, 100))
shaded = hd.spread(hd.datashade(ranged, cmap=cmap))

fare_v_tip = shaded.redim.label(fare_amount="Fare Amount ($)", tip_amount="Tip ($)")

Display the answers to both side by side.

In [None]:
riders_v_tip + fare_v_tip

## cuxfilter

[GitHub](https://github.com/rapidsai/cuxfilter)

> cuxfilter (ku-cross-filter) is a RAPIDS framework to connect web visualizations to GPU accelerated crossfiltering. Inspired by the javascript version of the original, it enables interactive and super fast multi-dimensional filtering of 100 million+ row tabular datasets via cuDF.

cuxfilter allows us to culminate these charts into a dashboard.

In [None]:
import cuxfilter

Create `cuxfilter.DataFrame` from a `cudf.DataFrame`.

In [None]:
cux_df = cuxfilter.DataFrame.from_dataframe(bc.sql('SELECT passenger_count, tip_amount, dropoff_x, dropoff_y FROM taxi'))

Create some charts & define a dashboard object.

In [None]:
chart_0 = cuxfilter.charts.datashader.scatter_geo(x='dropoff_x', y='dropoff_y')

chart_1 = cuxfilter.charts.bokeh.bar('passenger_count', add_interaction=False)

chart_2 = cuxfilter.charts.datashader.heatmap(x='passenger_count', y='tip_amount', x_range=[-0.5, 6.5], y_range=[0, 100], 
                                              color_palette=cmap, title='Passenger Count vs Tip Amount ($)')

In [None]:
dashboard = cux_df.dashboard([chart_0, chart_1, chart_2], title='NYC Yellow Cab')

Display charts in Notebook with `.view()`.

In [None]:
chart_0.view()

In [None]:
chart_2.view()

## Multi-GPU Data Visualization

Packages like Datashader and cuxfilter support dask_cudf distributed objects (Series, DataFrame).

In [None]:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster()
client = Client(cluster)

bc = BlazingContext(dask_client=client, network_interface='lo')

In [None]:
bc.s3('blazingsql-colab', bucket_name='blazingsql-colab')

bc.create_table('distributed_taxi', 's3://blazingsql-colab/yellow_taxi/taxi_data.parquet')

Dropoffs from 6:00 PM to 4:00 AM

In [None]:
query = '''
        select 
            dropoff_x, dropoff_y 
        from 
            distributed_taxi 
            where  
                hour(cast(tpep_pickup_datetime || '.0' as TIMESTAMP)) BETWEEN 18 AND 23
                OR hour(cast(tpep_pickup_datetime || '.0' as TIMESTAMP)) BETWEEN 0 AND 3
                '''

nyc = Canvas().points(bc.sql(query), 'dropoff_x', 'dropoff_y')

tf.set_background(tf.shade(nyc, cmap=fire), "black")

## That's the Data Vizualization Tour!

You've seen the basics of Data Visualization in BlazingSQL Notebooks and how to utilize it. Now is a good time to experiment with your own data and see how to parse, clean, and extract meaningful insights from it.

We'll now get into how to run Machine Learning with popular Python and GPU-accelerated Python packages.

Continue to the [Machine Learning introductory Notebook](machine_learning.ipynb)