# BlazingSQL vs. Apache Spark 

Below we have one of our popular workloads running with [BlazingSQL](https://blazingsql.com/), and then with Apache Spark + PySpark.

In this notebook, we will cover: 
- How to read and query csv files with BlazingSQL.
- How BlazingSQL compares against Apache Spark (analyzing over 20M records).

## Import packages and create Blazing Context
You can think of the BlazingContext much like a Spark Context (i.e. information such as FileSystems you have registered and Tables you have created will be stored here). 

In [1]:
from blazingsql import BlazingContext
# start up BlazingSQL
bc = BlazingContext()

BlazingContext ready


### Load & Query Table
First, we need to download the netflow data (21,526,138 records) from AWS. If you do not wish to download the full 2.5G file, the first 100,000 rows of data are pre-downloaded at `data/small-chunk2.csv`, simply skip the cell below and change the file path when propmted 2 cells from now.

In [2]:
# save nf-chunk2 to data folder, may take a few minutes to download
!wget -P data/ https://blazingsql-colab.s3.amazonaws.com/netflow_data/nf-chunk2.csv 

--2020-01-20 22:14:17--  https://blazingsql-colab.s3.amazonaws.com/netflow_data/nf-chunk2.csv
Resolving blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)... 52.216.112.139
Connecting to blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)|52.216.112.139|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2725056295 (2.5G) [text/csv]
Saving to: ‘data/nf-chunk2.csv’


2020-01-20 22:15:06 (53.2 MB/s) - ‘data/nf-chunk2.csv’ saved [2725056295/2725056295]



## BlazingSQL 
Data in hand, we can test the preformance of BlazingSQL on this dataset. 

To use pre-downloaded data, change the file path to `data/small-chunk2.csv`.

In [3]:
import os
# determine current working directory 
cwd = os.getcwd()
# complete path to data
path = cwd + '/data/nf-chunk2.csv'
# what's the path?
path

'/home/winston/bsql-demos/data/nf-chunk2.csv'

In [4]:
%%time
# Create BlazingSQL table from GDF - There is no copy in this process
bc.create_table('netflow', path, header=0)

CPU times: user 9.9 ms, sys: 13.1 ms, total: 23 ms
Wall time: 1.14 s


<pyblazing.apiv2.context.BlazingTable at 0x7f3e181d1bd0>

In [5]:
%%time
# define the query
query = '''
        SELECT
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            SUM(a.firstSeenSrcTotalBytes) as bytesOut,
            SUM(a.firstSeenDestTotalBytes) as bytesIn,
            SUM(a.durationSeconds) as durationSeconds,
            MIN(parsedDate) as firstFlowDate,
            MAX(parsedDate) as lastFlowDate,
            COUNT(*) as attemptCount
        FROM 
            netflow a
        GROUP BY
            a.firstSeenSrcIp,
            a.firstSeenDestIp
            '''

# query the table (returns cuDF DataFrame)
gdf = bc.sql(query)

CPU times: user 5.07 s, sys: 2.61 s, total: 7.67 s
Wall time: 10.4 s


In [6]:
# how's it look?
gdf.head(10)

Unnamed: 0,source,destination,targetPorts,bytesOut,bytesIn,durationSeconds,firstFlowDate,lastFlowDate,attemptCount
0,172.30.2.60,10.0.0.9,82,34839,47716,134,2013-04-03 06:48:47,2013-04-03 12:12:37,82
1,172.10.1.162,10.0.0.11,87,39628,53983,24,2013-04-03 06:50:13,2013-04-03 14:58:35,87
2,10.1.0.76,172.10.1.82,1,633,392,0,2013-04-03 09:55:05,2013-04-03 09:55:05,1
3,172.30.1.56,172.0.0.1,25,3330,3240,67,2013-04-03 01:59:09,2013-04-03 22:05:39,25
4,172.30.1.10,10.0.0.12,69,31042,43044,25,2013-04-03 06:48:01,2013-04-03 12:11:40,69
5,172.10.1.89,10.0.0.5,112,51222,70260,24,2013-04-03 06:48:24,2013-04-03 15:17:39,112
6,172.10.1.234,10.0.0.5,104,47287,64750,18,2013-04-03 06:53:55,2013-04-03 15:11:07,104
7,172.30.2.125,10.0.0.9,69,30701,41558,341,2013-04-03 06:50:50,2013-04-03 12:12:37,69
8,172.30.1.85,10.0.0.8,84,37828,52864,3,2013-04-03 06:48:21,2013-04-03 12:06:53,84
9,10.0.0.9,172.30.1.124,1,632,391,0,2013-04-03 10:36:04,2013-04-03 10:36:04,1


## Apache Spark
The cell below installs Apache Spark ([PySpark](https://spark.apache.org/docs/latest/api/python/index.html)).

In [7]:
# installs Spark (2.4.4 Jan 2020)
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)
[K     |████████████████████████████████| 215.7MB 50kB/s s eta 0:00:01
[?25hCollecting py4j==0.10.7
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 54.4MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-2.4.4-py2.py3-none-any.whl size=216130387 sha256=14abaa33edbf681f432ee00d234718731961da639e5eec86c4784667d43b4f5d
  Stored in directory: /home/winston/.cache/pip/wheels/ab/09/4d/0d184230058e654eb1b04467dbc1292f00eaa186544604b471
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installe

#### PyBlazing vs PySpark
With everything installed we can launch a SparkSession and see how BlazingSQL stacks up.

In [1]:
%%time
# copied this cell's snippet from another Google Colab by Luca Canali here: https://colab.research.google.com/github/LucaCanali/sparkMeasure/blob/master/examples/SparkMeasure_Jupyter_Colab_Example.ipynb

from pyspark.sql import SparkSession

# Create Spark Session
# This example uses a local cluster, you can modify master to use  YARN or K8S if available 
# This example downloads sparkMeasure 0.13 for scala 2_11 from maven central

spark = SparkSession \
        .builder \
        .master("local[*]") \
        .appName("PySpark Netflow Benchmark code") \
        .config("spark.jars.packages","ch.cern.sparkmeasure:spark-measure_2.11:0.13")  \
        .getOrCreate()

CPU times: user 321 ms, sys: 208 ms, total: 529 ms
Wall time: 3.65 s


### Load & Query Table

In [2]:
%%time
# load CSV into Spark
netflow_df = spark.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('data/nf-chunk2.csv')

CPU times: user 20.2 ms, sys: 11.3 ms, total: 31.5 ms
Wall time: 2min 46s


In [3]:
%%time
# create table for querying
netflow_df.createOrReplaceTempView('netflow')

CPU times: user 1.72 ms, sys: 176 µs, total: 1.9 ms
Wall time: 157 ms


In [4]:
%%time
# define the same query run tested on blazingsql above
query = '''
        SELECT
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            SUM(a.firstSeenSrcTotalBytes) as bytesOut,
            SUM(a.firstSeenDestTotalBytes) as bytesIn,
            SUM(a.durationSeconds) as durationSeconds,
            MIN(parsedDate) as firstFlowDate,
            MAX(parsedDate) as lastFlowDate,
            COUNT(*) as attemptCount
        FROM
            netflow a
        GROUP BY
            a.firstSeenSrcIp,
            a.firstSeenDestIp
            '''

# query with Spark
edges_df = spark.sql(query)

# set/display results
edges_df.show(10)

+------------+---------------+-----------+--------+-------+---------------+-------------------+-------------------+------------+
|      source|    destination|targetPorts|bytesOut|bytesIn|durationSeconds|      firstFlowDate|       lastFlowDate|attemptCount|
+------------+---------------+-----------+--------+-------+---------------+-------------------+-------------------+------------+
| 172.10.1.13|239.255.255.250|         15|    2975|      0|              6|2013-04-03 06:36:19|2013-04-03 06:36:27|          15|
|172.30.1.204|239.255.255.250|          8|    1750|      0|              6|2013-04-03 06:36:13|2013-04-03 06:36:20|           8|
| 172.30.2.86|      172.0.0.1|          1|     540|      0|              2|2013-04-03 06:36:09|2013-04-03 06:36:09|           1|
|172.30.1.246|      172.0.0.1|         29|    2610|   2610|              0|2013-04-03 00:26:46|2013-04-03 23:06:00|          29|
| 172.30.1.51|239.255.255.250|         16|    3850|      0|             18|2013-04-03 06:35:22|20