# Querying Multiple Data Formats

In this notebook, we will cover: 
- How to set up [BlazingSQL](https://blazingsql.com) and the [RAPIDS AI](https://rapids.ai/) suite.
- In this demo we will join a CSV, a Parquet File, and a GPU DataFrame(GDF) in a single query using BlazingSQL.

![Impression](https://www.google-analytics.com/collect?v=1&tid=UA-39814657-5&cid=555&t=event&ec=guides&ea=federated_query_demo&dt=federated_query_demo)

## Setup
### Environment Sanity Check 

RAPIDS packages (BlazingSQL included) require Pascal+ architecture to run. For Colab, this translates to a T4 GPU instance. 

The cell below will let you know what type of GPU you've been allocated, and how to proceed.

In [3]:
# tag specs
colab_smi = !nvidia-smi

# focus GPU type
try:
    my_gpu = ' '.join(colab_smi[7].split()[2:4])
# not on gpu acceleration 
except:
    raise Exception("\nPlease make sure you've configured Colab to request a GPU instance type.\n\n"
                    "At top of Colab, try: Runtime -> Change runtime type -> Hardware accelerator -> GPU -> Save\n")

# not allocated compatable GPU
if (my_gpu != b'Tesla T4') and (my_gpu != 'Tesla P100-PCIE...') and (my_gpu != 'GeForce GTX'):
    # allocated K80
    if my_gpu == 'Tesla K80':
        raise Exception("\nYou've been allocated a K80 instance\n\n"
                    "Unfortunately, this demo requires a T4 instance\n\n"
                    "At top of Colab, try: Runtime -> Reset all runtimes...\n")
    else:
        raise Exception(f"\nYou've achieved wizardy.\nyour GPU is {my_gpu}\nPlease inform info@blazingsql.com")

# allocated compatable GPU
else:
    print('Woo! You got the right kind of GPU!')

Woo! You got the right kind of GPU!


## Installs 

Below you will find three code blocks:
1. The first installs miniconda.
2. The second installs RAPIDS AI and sets up the system environment. 
3. The third installs BlazingSQL.

### Miniconda

In [5]:
# intall miniconda
!wget -c https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
!chmod +x Miniconda3-4.5.4-Linux-x86_64.sh
!bash ./Miniconda3-4.5.4-Linux-x86_64.sh -b -f -p /usr/local

### RAPIDS AI

In [4]:
# install RAPIDS packages
!conda install -q -y --prefix /usr/local -c nvidia -c rapidsai \
  -c numba -c conda-forge -c pytorch -c defaults \
  cudf=0.9 cuml=0.9 cugraph=0.9 python=3.6 cudatoolkit=10.0

# set environment vars
import sys, os, shutil
sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

# copy .so files to current working dir
for fn in ['libcudf.so', 'librmm.so']:
    shutil.copy('/usr/local/lib/'+fn, os.getcwd())

### BlazingSQL

In [None]:
# Install BlazingSQL for CUDA 10.0
! conda install -q -y --prefix /usr/local -c conda-forge -c defaults -c nvidia -c rapidsai \
   -c blazingsql/label/cuda10.0 -c blazingsql \
   blazingsql-calcite blazingsql-orchestrator blazingsql-ral blazingsql-python

!pip install flatbuffers

## Import packages and create Blazing Context
You can think of the BlazingContext much like a Spark Context (i.e. where information such as FileSystems you have registered and Tables you have created will be stored). If you have issues running this cell, restart runtime and try running it again.

In [1]:
# Import RAPIDS AI stack
from blazingsql import BlazingContext
import cudf

bc = BlazingContext()

BlazingContext ready


# Grab Data

The data is in our public S3 bucket so we can use wget to grab it

In [2]:
!wget 'https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_00.csv'
!wget 'https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_01.parquet'
!wget 'https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_02.csv'

--2019-10-20 18:21:56--  https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_00.csv
Resolving blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)... 52.216.168.235
Connecting to blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)|52.216.168.235|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1233 (1.2K) [text/csv]
Saving to: ‘cancer_data_00.csv’


2019-10-20 18:21:57 (59.1 MB/s) - ‘cancer_data_00.csv’ saved [1233/1233]

--2019-10-20 18:21:57--  https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_01.parquet
Resolving blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)... 52.216.168.235
Connecting to blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)|52.216.168.235|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2364 (2.3K) [binary/octet-stream]
Saving to: ‘cancer_data_01.parquet’


2019-10-20 18:21:57 (2.27 MB/s) - ‘cancer_data_01.parqu

# Create Table from CSV


In [2]:
column_names = ['diagnosis_result', 'radius', 'texture', 'perimeter']
column_types = ['float32', 'float32', 'float32', 'float32']

bc.create_table('data_00', 'cancer_data_00.csv', delimiter=',', dtype=column_types, names=column_names)


b"In function ddlCreateTableService: cannot create the table: Path 'cancer_data_00.csv' does not exist. File paths are expected to be in one of the following formats:   For local file paths: '/folder0/folder1/fileName.extension'    For s3 file paths: 's3://registeredFileSystemName/folder0/folder1/fileName.extension'    For HDFS file paths: 'hdfs://registeredFileSystemName/folder0/folder1/fileName.extension'"


<pyblazing.apiv2.sql.Table at 0x7f7bfc69eda0>

# Create Table from Parquet



In [7]:
bc.create_table('data_01', 'cancer_data_01.parquet')
       

b"In function ddlCreateTableService: cannot create the table: Path 'cancer_data_01.parquet' does not exist. File paths are expected to be in one of the following formats:   For local file paths: '/folder0/folder1/fileName.extension'    For s3 file paths: 's3://registeredFileSystemName/folder0/folder1/fileName.extension'    For HDFS file paths: 'hdfs://registeredFileSystemName/folder0/folder1/fileName.extension'"


<pyblazing.apiv2.sql.Table at 0x7fc688061e48>

# Create Table from GPU DataFrame (GDF)
Here we use cuDF to create a GDF, we then using BlazingSQL to create a table from that GDF. The GDF is the standard memory representation for the RAPIDS AI ecosystem.

In [None]:
column_names = ['compactness', 'symmetry', 'fractal_dimension']
column_types = ['float32', 'float32', 'float32', 'float32']

gdf_02= cudf.read_csv('/content/cancer_data_02.csv',delimiter=',', dtype=column_types, names=column_names)

bc.create_table('data_02', gdf_02)

# Join Tables Together 

Now we can use BlazingSQL to join all three data formats in a single federated query. 

In [None]:
sql = '''
SELECT a.*, b.area, b.smoothness, c.* from main.data_00 as a
LEFT JOIN main.data_01  as b
ON (a.perimeter = b.perimeter)
LEFT JOIN main.data_02 as c
ON (b.compactness = c.compactness)
'''
join = bc.sql(sql).get()
result = join.columns

print(result)

     diagnosis_result  radius  ...     symmetry  fractal_dimension
0                 0.0    19.0  ...  0.196999997        0.067999996
1                 1.0    14.0  ...  0.215000004        0.067000002
2                 1.0    18.0  ...        0.162              0.057
3                 0.0    22.0  ...        0.147              0.059
4                 0.0    21.0  ...        0.147              0.059
5                 1.0    15.0  ...  0.181000009              0.057
6                 0.0    25.0  ...  0.181000009              0.057
7                 0.0    19.0  ...  0.181000009              0.057
8                 0.0    22.0  ...        0.147              0.059
9                 0.0    21.0  ...  0.172000006        0.063999996
10                1.0    24.0  ...  0.193999991        0.068999998
11                1.0    18.0  ...  0.158000007        0.054000001
12                1.0    20.0  ...  0.193999991        0.068999998
13                0.0    22.0  ...  0.172000006        0.06399

*And*... thats it! Check out our [docs](https://docs.blazingdb.com) to get fancy as well as to learn more about how BlazingSQL works with the rest of [RAPIDS AI](https://rapids.ai/). 