# Graphistry Netflow Demo

In this example we are taking millions of rows of netflow (network traffic flow) data in order to search for anomalous activity within a network. We will query 70M+ rows of network security data (netflow) with BlazingSQL and pass it to Graphistry for visualization.

## Download CSV

The cell below will download the data (4 Parquet files totaling 73,397,810 rows) for this demo from AWS and store it locally in the `data` directory. If you do not wish to download the full files, the first 100,000 rows of data are pre-downloaded at data/small-chunk2.csv, simply skip the cell below and change the file path when propmted in the `Create` cell below.

In [20]:
!wget -P data/ https://blazingsql-colab.s3.amazonaws.com/netflow_parquet/1_0_0.parquet
!wget -P data/ https://blazingsql-colab.s3.amazonaws.com/netflow_parquet/1_1_0.parquet 
!wget -P data/ https://blazingsql-colab.s3.amazonaws.com/netflow_parquet/1_2_0.parquet
!wget -P data/ https://blazingsql-colab.s3.amazonaws.com/netflow_parquet/1_3_0.parquet    

--2019-11-15 23:51:10--  https://blazingsql-colab.s3.amazonaws.com/netflow_parquet/1_0_0.parquet
Resolving blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)... 52.216.145.83
Connecting to blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)|52.216.145.83|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 112396980 (107M) [application/x-www-form-urlencoded]
Saving to: ‘data/1_0_0.parquet’


2019-11-15 23:51:12 (62.7 MB/s) - ‘data/1_0_0.parquet’ saved [112396980/112396980]

--2019-11-15 23:51:12--  https://blazingsql-colab.s3.amazonaws.com/netflow_parquet/1_1_0.parquet
Resolving blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)... 52.216.145.83
Connecting to blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)|52.216.145.83|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 141547600 (135M) [application/x-www-form-urlencoded]
Saving to: ‘data/1_1_0.parquet’


2019-11-15

## Blazing Context
Here we are importing cuDF and BlazingContext. You can think of the BlazingContext much like a Spark Context (i.e. where information such as FileSystems you have registered and Tables you have created will be stored). If you have issues running this cell, restart runtime and try running it again.

In [12]:
from blazingsql import BlazingContext 

bc = BlazingContext()

Already connected to the Orchestrator
BlazingContext ready


### Create & Query Tables
In this next cell we identify the full path to the data we've downloaded.

In [13]:
# identify working directory path
local_path = !pwd

# make wildcard path to load all 4 parquet files into blazingsql
path = local_path[0] + '/data/*_0.parquet'

# what's the path? 
path

'/home/winston/bsql-demos/data/*_0.parquet'

#### Create
Here use the path identified above to load all 4 parquet files into a single BlazingSQL table. This is done by using a wildcard (*) in the file path. 

Note: point path to `data/small-chunk2.csv` for pre-downloaded data.

In [31]:
%%time
# blazingsql table from gpu dataframe
bc.create_table('netflow', path)

CPU times: user 4.16 ms, sys: 4.18 ms, total: 8.35 ms
Wall time: 298 ms


<pyblazing.apiv2.sql.Table at 0x7f7189dc4ac8>

#### Query
With the table made, we can simply run a SQL query.

We are going to run some joins and aggregations in order to condese these millions of rows into thousands of rows that represent nodes and edges.

In [32]:
%%time
# what are we looking for 
query = '''
        SELECT
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            SUM(a.firstSeenSrcTotalBytes) as bytesOut,
            SUM(a.firstSeenDestTotalBytes) as bytesIn,
            SUM(a.durationSeconds) as durationSeconds,
            MIN(parsedDate) as firstFlowDate,
            MAX(parsedDate) as lastFlowDate,
            COUNT(*) as attemptCount
        FROM
            netflow a
        GROUP BY
            a.firstSeenSrcIp,
            a.firstSeenDestIp
        '''

# run sql query
result = bc.sql(query).get()

# extract cudf dataframe from query result
gdf = result.columns

CPU times: user 29.3 ms, sys: 41.9 ms, total: 71.3 ms
Wall time: 4.51 s


In [33]:
# how do the results look?
gdf.head(25)

Unnamed: 0,source,destination,targetPorts,bytesOut,bytesIn,durationSeconds,firstFlowDate,lastFlowDate,attemptCount
0,10.0.0.13,60805,5,3165,25,856,2013-04-04 12:37:43,2013-04-06 14:21:53,5
1,10.0.0.7,58945,8,5056,40,1352,2013-04-01 10:50:05,2013-04-06 08:59:53,8
2,10.0.0.6,64531,9,5688,45,1503,2013-04-01 14:34:49,2013-04-06 12:32:14,9
3,10.1.0.76,1588,16,10128,82,3168,2013-04-01 08:42:09,2013-04-07 08:24:19,16
4,10.0.0.13,48255,8,5064,40,1408,2013-04-02 13:37:22,2013-04-07 08:30:07,8
5,10.0.0.10,62076,1,633,5,152,2013-04-03 11:16:20,2013-04-03 11:16:20,1
6,10.7.5.5,18256,5,22118307,9128,912,2013-04-03 11:03:52,2013-04-05 12:24:54,5
7,10.0.0.5,54363,12,7584,60,1956,2013-04-01 11:14:43,2013-04-05 13:16:40,12
8,172.20.0.4,36513,60,552030,495,2925,2013-04-03 08:26:33,2013-04-03 11:33:10,60
9,10.1.0.100,60962,3,1902,15,531,2013-04-02 08:22:00,2013-04-03 13:17:20,3
