# Graphistry Netflow Demo

In this example we are taking millions of rows of netflow (network traffic flow) data in order to search for anomalous activity within a network. We will query 20M rows of network security data (netflow) with BlazingSQL and pass it to Graphistry for visualization.

## Download CSV

The cell below will download the data for this demo from AWS and store it locally as `nf-chunk2.csv`. 

In [1]:
!wget https://blazingsql-colab.s3.amazonaws.com/netflow_data/nf-chunk2.csv

--2019-11-15 05:53:37--  https://blazingsql-colab.s3.amazonaws.com/netflow_data/nf-chunk2.csv
Resolving blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)... 52.217.41.36
Connecting to blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)|52.217.41.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2725056295 (2.5G) [text/csv]
Saving to: ‘nf-chunk2.csv’


2019-11-15 05:54:46 (38.1 MB/s) - ‘nf-chunk2.csv’ saved [2725056295/2725056295]



## Blazing Context
Here we are importing cuDF and BlazingContext. You can think of the BlazingContext much like a Spark Context (i.e. where information such as FileSystems you have registered and Tables you have created will be stored). If you have issues running this cell, restart runtime and try running it again.

In [2]:
from blazingsql import BlazingContext
import cudf

bc = BlazingContext()

BlazingContext ready


### Load & Query Tables

In the cell below, we are first loading the CSV file into a GPU DataFrame (gdf), and then creating tables so that we can run SQL queries on those GDFs. 

Note: when you create a table off of a GDF there is no copy, it is merely registering the schema.

In [3]:
%time
# cudf gpu dataframe from csv 
gdf = cudf.read_csv('nf-chunk2.csv')

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 7.39 µs


In [4]:
%time
# blazingsql table from gpu dataframe
bc.create_table('netflow', gdf)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.68 µs


<pyblazing.apiv2.sql.Table at 0x7f0098759fd0>

#### Query
With the table made, we can simply run a SQL query.

We are going to run some joins and aggregations in order to condese these millions of rows into thousands of rows that represent nodes and edges.

In [5]:
%%time
# what are we looking for 
query = '''
        SELECT
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            SUM(a.firstSeenSrcTotalBytes) as bytesOut,
            SUM(a.firstSeenDestTotalBytes) as bytesIn,
            SUM(a.durationSeconds) as durationSeconds,
            MIN(parsedDate) as firstFlowDate,
            MAX(parsedDate) as lastFlowDate,
            COUNT(*) as attemptCount
        FROM
            netflow a
        GROUP BY
            a.firstSeenSrcIp,
            a.firstSeenDestIp
        '''

# run sql query
result = bc.sql(query).get()

# extract cudf dataframe from query result
gdf = result.columns

CPU times: user 28.9 ms, sys: 4.25 ms, total: 33.1 ms
Wall time: 1.84 s


In [6]:
# how do the results look?
gdf.head(10)

Unnamed: 0,source,destination,targetPorts,bytesOut,bytesIn,durationSeconds,firstFlowDate,lastFlowDate,attemptCount
0,172.10.1.234,10.0.0.5,104,47287,64750,18,2013-04-03 06:53:55,2013-04-03 15:11:07,104
1,172.30.1.85,10.0.0.8,84,37828,52864,3,2013-04-03 06:48:21,2013-04-03 12:06:53,84
2,172.30.1.10,10.0.0.12,69,31042,43044,25,2013-04-03 06:48:01,2013-04-03 12:11:40,69
3,172.10.1.89,10.0.0.5,112,51222,70260,24,2013-04-03 06:48:24,2013-04-03 15:17:39,112
4,172.30.2.60,10.0.0.9,82,34839,47716,134,2013-04-03 06:48:47,2013-04-03 12:12:37,82
5,172.10.1.162,10.0.0.11,87,39628,53983,24,2013-04-03 06:50:13,2013-04-03 14:58:35,87
6,172.30.1.56,172.0.0.1,25,3330,3240,67,2013-04-03 01:59:09,2013-04-03 22:05:39,25
7,172.20.1.58,10.7.5.5,49,3041309,116400561,2091,2013-04-03 10:12:27,2013-04-03 11:20:09,49
8,172.30.2.125,10.0.0.9,69,30701,41558,341,2013-04-03 06:50:50,2013-04-03 12:12:37,69
9,172.10.1.106,10.199.250.2,40,66638,2863884,24,2013-04-03 07:19:02,2013-04-03 10:12:35,40
