#Install BlazingSQL + RAPIDS AI

In [0]:
!nvidia-smi

Fri Aug 23 21:33:21 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [0]:
%%writefile bsql-colab.sh

#!/bin/bash


set -eu

wget -nc https://github.com/rapidsai/notebooks-extended/raw/master/utils/env-check.py
echo "Checking for GPU type:"
python env-check.py

if [ ! -f Miniconda3-4.5.4-Linux-x86_64.sh ]; then
    echo "Removing conflicting packages, will replace with RAPIDS compatible versions"
    # remove existing xgboost and dask installs
    pip uninstall -y xgboost dask distributed

    # intall miniconda
    echo "Installing conda"
    wget https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
    chmod +x Miniconda3-4.5.4-Linux-x86_64.sh
    bash ./Miniconda3-4.5.4-Linux-x86_64.sh -b -f -p /usr/local
    
    echo "Installing RAPIDS packages"
    echo "Please standby, this will take a few minutes..."
    # install RAPIDS packages
    conda install -y --prefix /usr/local \
      -c rapidsai/label/xgboost -c rapidsai -c nvidia -c conda-forge \
      python=3.6 cudatoolkit=10.0 \
      cudf=0.9.* cuml=0.9.* cugraph=0.9.* gcsfs pynvml \
      dask-cudf=0.9.* \
      rapidsai/label/xgboost::xgboost=>0.9
      
    echo "Copying shared object files to /usr/lib"
    # copy .so files to /usr/lib, where Colab's Python looks for libs
    cp /usr/local/lib/libcudf.so /usr/lib/libcudf.so
    cp /usr/local/lib/librmm.so /usr/lib/librmm.so
    cp /usr/local/lib/libxgboost.so /usr/lib/libxgboost.so
    cp /usr/local/lib/libnccl.so /usr/lib/libnccl.so
    conda install -y --prefix /usr/local -c rapidsai -c nvidia -c conda-forge -c felipeblazing/label/cuda10.0 python=3.6 cudatoolkit=10.0 blazingsql-ral blazingsql-orchestrator blazingsql-calcite blazingsql-python
    pip install flatbuffers
fi
echo ""
echo "*********************************************"
echo "Your Google Colab instance is RAPIDS ready!"
echo "*********************************************"



echo ""
echo "*********************************************"
echo "Your Google Colab instance is BlazingSQL ready!"
echo "*********************************************"


Writing bsql-colab.sh


In [0]:
%%time
!bash bsql-colab.sh

--2019-08-23 21:33:23--  https://github.com/rapidsai/notebooks-extended/raw/master/utils/env-check.py
Resolving github.com (github.com)... 192.30.253.113
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/rapidsai/notebooks-contrib/raw/master/utils/env-check.py [following]
--2019-08-23 21:33:23--  https://github.com/rapidsai/notebooks-contrib/raw/master/utils/env-check.py
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rapidsai/notebooks-contrib/master/utils/env-check.py [following]
--2019-08-23 21:33:24--  https://raw.githubusercontent.com/rapidsai/notebooks-contrib/master/utils/env-check.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.

# You are ready to go with BlazingSQL!
Nice Job! Now lets see how it works.

# Download CSV

You will need to download the CSV we are going to use for this demo.

In [0]:
!wget https://blazingsql-colab.s3.amazonaws.com/netflow_data/nf-chunk2.csv

--2019-08-23 21:43:50--  https://blazingsql-colab.s3.amazonaws.com/netflow_data/nf-chunk2.csv
Resolving blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)... 52.216.137.76
Connecting to blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)|52.216.137.76|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2725056295 (2.5G) [text/csv]
Saving to: ‘nf-chunk2.csv’


2019-08-23 21:44:46 (46.2 MB/s) - ‘nf-chunk2.csv’ saved [2725056295/2725056295]



#Load & Query Tables

Here we are importing cuDF and BlazingSQL. We are then loading the CSV files into a GPU DataFrame (gdf), and then creating tables so that we can run SQL queries on those GDFs. 

Note, when you create a table off of a GDF there is no copy, it is merely registering the schema.

In [0]:
# Set Environment Variables

import sys, os

sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

#Standup the BlazingSQL Services - We are working on removing the need to call these functions and just initializing them in BlazingContext
import subprocess
subprocess.Popen(['blazingsql-orchestrator', '9100', '8889', '127.0.0.1', '8890'],stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
subprocess.Popen(['java', '-jar', '/usr/local/lib/blazingsql-algebra.jar', '-p', '8890'])
import pyblazing.apiv2.context as cont
cont.runRal()
time.sleep(1) #Wait for service to start.

In [0]:
from blazingsql import BlazingContext
import cudf

bc = BlazingContext()

connection established
CPU times: user 1.91 ms, sys: 13.2 ms, total: 15.1 ms
Wall time: 38 ms


In [0]:
#Load CSVs into GPU DataFrames (gdf)
netflow_gdf = cudf.read_csv('/content/nf-chunk2.csv')

#Create BlazingSQL Tables - There is no copy in this process
bc.create_table('netflow', netflow_gdf)

From there we can simply run a SQL query.

In this example we are taking millions of rows of netflow (network flow) data in order to search for anomalous activity within a network.

We are going to run some joins and aggregations in order to condese these millions of rows into thousands of rows that represent nodes and edges.

In [0]:
%%time
sql = '''
SELECT
  a.firstSeenSrcIp as source,
  a.firstSeenDestIp as destination,
  count(a.firstSeenDestPort) as targetPorts,
  SUM(a.firstSeenSrcTotalBytes) as bytesOut,
  SUM(a.firstSeenDestTotalBytes) as bytesIn,
  SUM(a.durationSeconds) as durationSeconds,
  MIN(parsedDate) as firstFlowDate,
  MAX(parsedDate) as lastFlowDate,
  COUNT(*) as attemptCount
  FROM
  main.netflow a
  GROUP BY
  a.firstSeenSrcIp,
  a.firstSeenDestIp
  '''

result = bc.sql(sql).get()
result_gdf = result.columns
df = result_gdf.to_pandas()
print(df.head(10))

  firstSeenSrcIp firstSeenDestIp  ...         lastFlowDate  attemptCount
0    172.30.2.60        10.0.0.9  ...  2013-04-03 12:12:37            82
1   172.10.1.162       10.0.0.11  ...  2013-04-03 14:58:35            87
2   172.10.1.234        10.0.0.5  ...  2013-04-03 15:11:07           104
3      10.1.0.76     172.10.1.82  ...  2013-04-03 09:55:05             1
4    172.10.1.89        10.0.0.5  ...  2013-04-03 15:17:39           112
5   172.30.1.201       172.0.0.1  ...  2013-04-03 23:06:00            29
6   172.10.1.106    10.199.250.2  ...  2013-04-03 10:12:35            40
7    172.30.1.10       10.0.0.12  ...  2013-04-03 12:11:40            69
8    172.20.1.58        10.7.5.5  ...  2013-04-03 11:20:09            49
9   172.30.2.125        10.0.0.9  ...  2013-04-03 12:12:37            69

[10 rows x 9 columns]
CPU times: user 59.7 ms, sys: 10.1 ms, total: 69.7 ms
Wall time: 2.28 s
