# Graphistry Netflow Demo

In this example we are taking millions of rows of netflow (network traffic flow) data in order to search for anomalous activity within a network. We will query 70M+ rows of network security data (netflow) with BlazingSQL and pass it to Graphistry for visualization.

## Blazing Context
Here we are importing cuDF and BlazingContext. You can think of the BlazingContext much like a Spark Context (i.e. where information such as FileSystems you have registered and Tables you have created will be stored). If you have issues running this cell, restart runtime and try running it again.

In [12]:
from blazingsql import BlazingContext 

bc = BlazingContext()

Already connected to the Orchestrator
BlazingContext ready


### Create & Query Tables
In this next cell we identify the full path to the data.

In [13]:
# identify working directory path
local_path = !pwd

# make wildcard path to load all 4 parquet files into blazingsql
path = str(local_path) + '/data/*_0.parquet'

# what's the path? 
path

'/home/winston/bsql-demos/data/*_0.parquet'

#### Create
Here use the path identified above to load all 4 parquet files into a single BlazingSQL table. This is done by using a wildcard (*) in the file path. 

Note: point path to `data/small-chunk2.csv` for pre-downloaded data.

In [31]:
%%time
# blazingsql table from gpu dataframe
bc.create_table('netflow', path)

CPU times: user 4.16 ms, sys: 4.18 ms, total: 8.35 ms
Wall time: 298 ms


<pyblazing.apiv2.sql.Table at 0x7f7189dc4ac8>

#### Query
With the table made, we can simply run a SQL query.

We are going to run some joins and aggregations in order to condese these millions of rows into thousands of rows that represent nodes and edges.

In [32]:
%%time
# what are we looking for 
query = '''
        SELECT
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            SUM(a.firstSeenSrcTotalBytes) as bytesOut,
            SUM(a.firstSeenDestTotalBytes) as bytesIn,
            SUM(a.durationSeconds) as durationSeconds,
            MIN(parsedDate) as firstFlowDate,
            MAX(parsedDate) as lastFlowDate,
            COUNT(*) as attemptCount
        FROM
            netflow a
        GROUP BY
            a.firstSeenSrcIp,
            a.firstSeenDestIp
        '''

# run sql query (returns cuDF DataFrame)
gdf = bc.sql(query)

# how do the results look?
gdf.head(25)

CPU times: user 29.3 ms, sys: 41.9 ms, total: 71.3 ms
Wall time: 4.51 s
