# Querying Multiple Data Formats 
In this demo we join a CSV, a Parquet File, and a GPU DataFrame(GDF) in a single query using BlazingSQL.

In this notebook, we will cover: 
- How to set up [BlazingSQL](https://blazingsql.com) and the [RAPIDS AI](https://rapids.ai/) suite.
- How to create and then join BlazingSQL tables from CSV, Parquet, and GPU DataFrame (GDF) sources. 

![Impression](https://www.google-analytics.com/collect?v=1&tid=UA-39814657-5&cid=555&t=event&ec=guides&ea=federated_query_demo&dt=federated_query_demo)

## Setup
### Environment Sanity Check 

RAPIDS packages (BlazingSQL included) require Pascal+ architecture to run. For Colab, this translates to a T4 GPU instance. 

The cell below will let you know what type of GPU you've been allocated, and how to proceed.

In [1]:
!wget https://github.com/BlazingDB/bsql-demos/raw/master/utils/colab_env.py
!python colab_env.py 



***********************************
GPU = b'Tesla T4'
Woo! You got the right kind of GPU!
***********************************




## Installs 
The cell below pulls our Google Colab install script from the `bsql-demos` repo then runs it. The script first installs miniconda, then uses miniconda to install BlazingSQL and RAPIDS AI. This takes a few minutes to run. 

In [None]:
!wget https://github.com/BlazingDB/bsql-demos/raw/master/utils/bsql-colab.sh 
!bash bsql-colab.sh

import sys, os, time
sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

import subprocess
subprocess.Popen(['blazingsql-orchestrator', '9100', '8889', '127.0.0.1', '8890'], 
                 stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
subprocess.Popen(['java', '-jar', '/usr/local/lib/blazingsql-algebra.jar', '-p', '8890'])

import pyblazing.apiv2.context as cont
cont.runRal()
time.sleep(1) 

## Grab Data

The data is in our public S3 bucket so we can use wget to grab it

In [None]:
!wget 'https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_00.csv'
!wget 'https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_01.parquet'
!wget 'https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_02.csv'

## Import packages and create Blazing Context
You can think of the BlazingContext much like a Spark Context (i.e. where information such as FileSystems you have registered and Tables you have created will be stored). If you have issues running this cell, restart runtime and try running it again.

In [1]:
from blazingsql import BlazingContext
# start up BlazingSQL
bc = BlazingContext()

BlazingContext ready


### Create Table from CSV
Here we create a BlazingSQL table directly from a comma-separated values (CSV) file.

In [18]:
# define column names and types
column_names = ['diagnosis_result', 'radius', 'texture', 'perimeter']
column_types = ['float32', 'float32', 'float32', 'float32']

# create table from CSV file
bc.create_table('data_00', '/content/cancer_data_00.csv', dtype=column_types, names=column_names)

<pyblazing.apiv2.sql.Table at 0x7f9aa84ba710>

### Create Table from Parquet
Here we create a BlazingSQL table directly from an Apache Parquet file.

In [19]:
# create table from Parquet file
bc.create_table('data_01', '/content/cancer_data_01.parquet')

<pyblazing.apiv2.sql.Table at 0x7f9aa85487f0>

### Create Table from GPU DataFrame
Here we use cuDF to create a GPU DataFrame (GDF), then use BlazingSQL to create a table from that GDF.

The GDF is the standard memory representation for the RAPIDS AI ecosystem.

In [23]:
import cudf

# define column names and types
column_names = ['compactness', 'symmetry', 'fractal_dimension']
column_types = ['float32', 'float32', 'float32', 'float32']

# make GDF with cuDF
gdf_02 = cudf.read_csv('/content/cancer_data_02.csv', dtype=column_types, names=column_names)

# create BlazingSQL table from GDF
bc.create_table('data_02', gdf_02)

<pyblazing.apiv2.sql.Table at 0x7f9aa84ce4e0>

# Join Tables Together 

Now we can use BlazingSQL to join all three data formats in a single federated query. 

In [21]:
# define a query
sql = '''
        SELECT 
            a.*, 
            b.area, b.smoothness, 
            c.* 
        FROM data_00 as a
            LEFT JOIN data_01 as b
                ON (a.perimeter = b.perimeter)
            LEFT JOIN data_02 as c
                ON (b.compactness = c.compactness)
                '''

# join the tables together
gdf = bc.sql(sql)

# display results
gdf

Unnamed: 0,diagnosis_result,radius,texture,perimeter,area,smoothness,compactness,symmetry,fractal_dimension
0,0.0,22.0,14.0,78.0,386.0,0.070,0.284000009,0.25999999,0.097000003
1,0.0,25.0,11.0,87.0,545.0,0.104,0.143999994,0.196999997,0.067999996
2,0.0,19.0,22.0,87.0,545.0,0.104,0.143999994,0.196999997,0.067999996
3,0.0,19.0,27.0,62.0,295.0,0.102,0.052999999,0.13499999,0.068999998
4,1.0,19.0,27.0,72.0,371.0,0.123,0.122000001,0.189999998,0.068999998
5,0.0,22.0,14.0,78.0,451.0,0.105,0.071000002,0.189999998,0.066
6,1.0,15.0,21.0,87.0,545.0,0.104,0.143999994,0.196999997,0.067999996
7,0.0,21.0,24.0,74.0,413.0,0.090,0.075000003,0.162,0.066
8,1.0,17.0,20.0,96.0,699.0,0.094,0.050999999,0.138999999,0.052999999
9,0.0,23.0,26.0,54.0,225.0,0.098,0.052999999,0.13499999,0.068999998


# You're Ready to Rock
And... thats it! You are now live with BlazingSQL.

Check out our [docs](https://docs.blazingdb.com) to get fancy or to learn more about how BlazingSQL works with the rest of [RAPIDS AI](https://rapids.ai/).