# Data Visualization

Let's go through some of the data visualization possibilities available within BlazingSQL Notebooks.

The RAPIDS ecosystem and `cudf.DataFrame` are built on a series of standards to simplify interoperability with many different tools. A `cudf.DataFrame`'s ability to easily convert to a `pandas.DataFrame` makes a large portion of the [Python Visualization (PyViz)](https://pyviz.org/overviews/index.html) stack immediately accessible. 

We've also included a few examples from the growing group of visualization libraries that are leveraging GPU-acceleration to quickly render millions or billions of points. 

## ETL - Let's Get Some Data

First, create a table to query from the [NYC Yellow Taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) you might have seen in our other demos.

In [None]:
from blazingsql import BlazingContext
bc = BlazingContext()

In [None]:
bc.create_table('taxi', '../data/sample_taxi.csv', header=0)

Let's give the data a quick look just to get an understanding.

In [None]:
bc.sql('select * from taxi').tail()

### Matplotlib 

[GitHub](https://github.com/matplotlib/matplotlib)

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

By calling the `.to_pandas()` method, we can convert a `cudf.DataFrame` into a `pandas.DataFrame` and hand off to Matplotlib or other CPU visualization packages.

Let's convert a SQL query into a `pandas.DataFrame` and plot a correlation matrix leveraging pandas functionality.

In [None]:
bc.sql('SELECT * FROM taxi').to_pandas().corr().style.background_gradient()

Does the number of riders influence the tip amount?

In [None]:
bc.sql('SELECT * FROM taxi').to_pandas().plot(kind='scatter', x='passenger_count', y='tip_amount')

How many riders are transported each hour?

In [None]:
riders_by_hour = '''
                 select
                     sum(passenger_count) as sum_riders,
                     hour(cast(tpep_pickup_datetime || '.0' as TIMESTAMP)) as hours
                 from
                     taxi
                 group by
                     hour(cast(tpep_pickup_datetime || '.0' as TIMESTAMP))
                 order by
                     hour(cast(tpep_pickup_datetime || '.0' as TIMESTAMP))
                     '''
bc.sql(riders_by_hour).to_pandas().plot(kind='bar', x='hours', y='sum_riders', title='Sum Riders by Hour', figsize=(12, 6))

How many passengers per ride?

In [None]:
avg_riders_by_hour = '''
                     select
                         avg(passenger_count) as avg_riders,
                         hour(ts_pickup) as hours
                     from (
                         select
                             passenger_count, 
                             cast(tpep_pickup_datetime || '.0' as TIMESTAMP) ts_pickup
                         from
                             taxi
                             )
                     group by
                         hour(ts_pickup)
                     order by
                         hour(ts_pickup)
                         '''
bc.sql(avg_riders_by_hour).to_pandas().plot(kind='line', x='hours', y='avg_riders', title='Avg. Riders per Trip by Hour', figsize=(12, 6))

### Datashader
    
[GitHub](https://github.com/holoviz/datashader/)

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data that supports receiving a `cudf.DataFrame`.

Using this very general pipeline, many interesting data visualizations can be created in a performant and scalable way. Datashader contains tools for easily creating these pipelines in a composable manner, using only a few lines of code. Datashader can be used on its own, but it is also designed to work as a pre-processing stage in a plotting library, allowing that library to work with much larger datasets than it would otherwise.

In [None]:
from datashader import Canvas, transfer_functions as tf
from colorcet import fire

Do dropoff locations change based on the time of day? Let's say late night vs morning.

#### Dropoffs from Midnight to 5:00 AM

In [None]:
query = '''
        select 
            dropoff_x, dropoff_y 
        from 
            taxi 
            where  
                hour(cast(tpep_pickup_datetime || '.0' as TIMESTAMP)) BETWEEN 0 AND 4
                '''
nyc = Canvas().points(bc.sql(query), 'dropoff_x', 'dropoff_y')
tf.set_background(tf.shade(nyc, cmap=fire), "black")

#### Dropoffs from 5:00 AM to 10:00 AM

In [None]:
query = '''
        select 
            dropoff_x, dropoff_y 
        from 
            taxi 
            where  
                hour(cast(tpep_pickup_datetime || '.0' as TIMESTAMP)) BETWEEN 5 AND 9
                '''
nyc = Canvas().points(bc.sql(query), 'dropoff_x', 'dropoff_y')
tf.set_background(tf.shade(nyc, cmap=fire), "black")

### HoloViews 

[GitHub](https://github.com/holoviz/holoviews)

HoloViews is an open-source Python library designed to make data analysis and visualization seamless and simple. With HoloViews, you can usually express what you want to do in very few lines of code, letting you focus on what you are trying to explore and convey, not on the process of plotting.

By calling the `.to_pandas()` method, we can convert a `cudf.DataFrame` into a `pandas.DataFrame` and hand off to HoloViews or other CPU visualization packages.

In [None]:
from holoviews import extension, opts
import holoviews.operation.datashader as hd
from holoviews import Scatter, Bars, Dimension

extension('bokeh')
opts.defaults(opts.Bars(height=450, width=900), opts.Scatter(height=450, width=450), opts.RGB(height=450, width=450))

Does passenger count affect tip amount?

In [None]:
s = Scatter(bc.sql('select passenger_count, tip_amount from taxi').to_pandas(), 'passenger_count', 'tip_amount')

# 0-6 passengers, $0-$60 tip
ranged = s.redim.range(passenger_count=(-0.5, 6.5), tip_amount=(0, 60))

shaded = hd.spread(hd.datashade(ranged, x_sampling=0.25))

shaded.redim.label(passenger_count="Passengers", tip_amount="Tip, $")

Does trip distance affect tip amount?

In [None]:
riders_by_hour = '''
                 select
                     sum(passenger_count) as sum_riders,
                     hour(cast(tpep_pickup_datetime || '.0' as TIMESTAMP)) as hours
                 from
                     taxi
                 group by
                     hour(cast(tpep_pickup_datetime || '.0' as TIMESTAMP))
                 order by
                     hour(cast(tpep_pickup_datetime || '.0' as TIMESTAMP))
                     '''
df = bc.sql(riders_by_hour).to_pandas()

data = [(df.hours[i], df.sum_riders[i]) for i in range(len(df))]

Bars(data, Dimension('Car occupants'), 'Count')

In [None]:
s = Scatter(bc.sql('select trip_distance, tip_amount from taxi').to_pandas(), 'trip_distance', 'tip_amount')

# set scope from 0 miles - 25 miles and $0 - $50
ranged = s.redim.range(trip_distance=(0, 25), tip_amount=(0, 50))

shaded = hd.spread(hd.datashade(ranged))

shaded.redim.label(trip_distance="Trip Distance", tip_amount="Tip ($)")

## That's the Data Vizualization Tour!

You've seen the basics of Data Visualization in BlazingSQL Notebooks and how to utilize it. Now is a good time to experiment with your own data and see how to parse, clean, and extract meaningful insights from it.

We'll now get into how to run Machine Learning with popular Python and GPU-accelerated Python packages.


[Continue to the Machine Learning introductory Notebook](machine_learning.ipynb)