#Querying Multiple Data Formats

In this demo we will join a CSV, a Parquet File, and a GPU DataFrame(GDF) in a single query using BlazingSQL.

# Setup

## Environment Sanity Check 

Click the Runtime dropdown at the top of the page, then Change Runtime Type and confirm the instance type is GPU.

Check the output of !nvidia-smi to make sure you've been allocated a Tesla T4.


In [0]:
!nvidia-smi

Fri Aug 23 21:04:15 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8    10W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

## Install BlazingSQL + cuDF
(this might take a minute)

In [0]:
%%writefile bsql-colab.sh

#!/bin/bash


set -eu

wget -nc https://github.com/rapidsai/notebooks-extended/raw/master/utils/env-check.py
echo "Checking for GPU type:"
python env-check.py

if [ ! -f Miniconda3-4.5.4-Linux-x86_64.sh ]; then
    echo "Removing conflicting packages, will replace with RAPIDS compatible versions"
    # remove existing xgboost and dask installs
    pip uninstall -y xgboost dask distributed

    # intall miniconda
    echo "Installing conda"
    wget https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
    chmod +x Miniconda3-4.5.4-Linux-x86_64.sh
    bash ./Miniconda3-4.5.4-Linux-x86_64.sh -b -f -p /usr/local
    
    echo "Installing RAPIDS packages"
    echo "Please standby, this will take a few minutes..."
    # install RAPIDS packages
    conda install -y --prefix /usr/local \
      -c rapidsai/label/xgboost -c rapidsai -c nvidia -c conda-forge \
      python=3.6 cudatoolkit=10.0 \
      cudf=0.9.* cuml=0.9.* cugraph=0.9.* gcsfs pynvml \
      dask-cudf=0.9.* \
      rapidsai/label/xgboost::xgboost=>0.9
      
    echo "Copying shared object files to /usr/lib"
    # copy .so files to /usr/lib, where Colab's Python looks for libs
    cp /usr/local/lib/libcudf.so /usr/lib/libcudf.so
    cp /usr/local/lib/librmm.so /usr/lib/librmm.so
    cp /usr/local/lib/libxgboost.so /usr/lib/libxgboost.so
    cp /usr/local/lib/libnccl.so /usr/lib/libnccl.so
    conda install -y --prefix /usr/local -c rapidsai -c nvidia -c conda-forge -c felipeblazing/label/cuda10.0 python=3.6 cudatoolkit=10.0 blazingsql-ral blazingsql-orchestrator blazingsql-calcite blazingsql-python
    pip install flatbuffers
fi
echo ""
echo "*********************************************"
echo "Your Google Colab instance is RAPIDS ready!"
echo "*********************************************"



echo ""
echo "*********************************************"
echo "Your Google Colab instance is BlazingSQL ready!"
echo "*********************************************"


Writing bsql-colab.sh


In [0]:
%%time
!bash bsql-colab.sh

--2019-08-23 21:04:20--  https://github.com/rapidsai/notebooks-extended/raw/master/utils/env-check.py
Resolving github.com (github.com)... 140.82.118.4
Connecting to github.com (github.com)|140.82.118.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/rapidsai/notebooks-contrib/raw/master/utils/env-check.py [following]
--2019-08-23 21:04:21--  https://github.com/rapidsai/notebooks-contrib/raw/master/utils/env-check.py
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rapidsai/notebooks-contrib/master/utils/env-check.py [following]
--2019-08-23 21:04:21--  https://raw.githubusercontent.com/rapidsai/notebooks-contrib/master/utils/env-check.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.13

In [0]:
import sys, os

sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

#we are working on removing the need to call these functions and just initializing them in BlazingContext
import subprocess
subprocess.Popen(['blazingsql-orchestrator', '9100', '8889', '127.0.0.1', '8890'],stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
subprocess.Popen(['java', '-jar', '/usr/local/lib/blazingsql-algebra.jar', '-p', '8890'])
import pyblazing.apiv2.context as cont
cont.runRal()

# You are ready to go with BlazingSQL!


# Import packages and create Blazing Context
You can think of the BlazingContext much like a Spark Context. This is where information such as FileSystems you have registered, Tables you have created will be stored.

If you have trouble running this cell, please reset the runtime in the menu above, and then try running it again. 


In [0]:
# Import RAPIDS AI stack
from blazingsql import BlazingContext
import cudf

bc = BlazingContext()

connection established


# Grab Data

The data is in our public S3 bucket so we can use wget to grab it

In [0]:
!wget 'https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_00.csv'
!wget 'https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_01.parquet'
!wget 'https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_02.csv'

--2019-08-23 21:14:20--  https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_00.csv
Resolving blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)... 52.216.170.3
Connecting to blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)|52.216.170.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1233 (1.2K) [text/csv]
Saving to: ‘cancer_data_00.csv’


2019-08-23 21:14:20 (74.7 MB/s) - ‘cancer_data_00.csv’ saved [1233/1233]

--2019-08-23 21:14:22--  https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_01.parquet
Resolving blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)... 52.216.170.3
Connecting to blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)|52.216.170.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2364 (2.3K) [binary/octet-stream]
Saving to: ‘cancer_data_01.parquet’


2019-08-23 21:14:23 (175 MB/s) - ‘cancer_data_01.parquet’ saved

# Create Table from CSV


In [0]:
column_names = ['diagnosis_result', 'radius', 'texture', 'perimeter']
column_types = ['float32', 'float32', 'float32', 'float32']

bc.create_table('data_00', '/content/cancer_data_00.csv', delimiter=',', dtype=column_types, names=column_names)


# Create Table from Parquet



In [0]:
bc.create_table('data_01', '/content/cancer_data_01.parquet')
       

# Create Table from GPU DataFrame (GDF)
Here we use cuDF to create a GDF, we then using BlazingSQL to create a table from that GDF. The GDF is the standard memory representation for the RAPIDS AI ecosystem.

In [0]:
column_names = ['compactness', 'symmetry', 'fractal_dimension']
column_types = ['float32', 'float32', 'float32', 'float32']

gdf_02= cudf.read_csv('/content/cancer_data_02.csv',delimiter=',', dtype=column_types, names=column_names)

bc.create_table('data_02', gdf_02)

# Join Tables Together 

Now we can use BlazingSQL to join all three data formats in a single federated query. 

In [0]:
sql = '''
SELECT a.*, b.area, b.smoothness, c.* from main.data_00 as a
LEFT JOIN main.data_01  as b
ON (a.perimeter = b.perimeter)
LEFT JOIN main.data_02 as c
ON (b.compactness = c.compactness)
'''
join = bc.sql(sql).get()
result = join.columns

print(result)

     diagnosis_result  radius  ...     symmetry  fractal_dimension
0                 0.0    19.0  ...  0.196999997        0.067999996
1                 1.0    14.0  ...  0.215000004        0.067000002
2                 1.0    18.0  ...        0.162              0.057
3                 0.0    22.0  ...        0.147              0.059
4                 0.0    21.0  ...        0.147              0.059
5                 1.0    15.0  ...  0.181000009              0.057
6                 0.0    25.0  ...  0.181000009              0.057
7                 0.0    19.0  ...  0.181000009              0.057
8                 0.0    22.0  ...        0.147              0.059
9                 0.0    21.0  ...  0.172000006        0.063999996
10                1.0    24.0  ...  0.193999991        0.068999998
11                1.0    18.0  ...  0.158000007        0.054000001
12                1.0    20.0  ...  0.193999991        0.068999998
13                0.0    22.0  ...  0.172000006        0.06399

*And*... thats it! Check out our [docs](https://docs.blazingdb.com) to get fancy as well as to learn more about how BlazingSQL works with the rest of [RAPIDS AI](https://rapids.ai/). 