## Tutorial 1: Network Mapping & Beacon Analysis from Network Logs

Run in a free GPU environment with BlazingSQL preinstalled at ZZZ. For the GPU tutorials without BlazingSQL, we recommend Google Colab.

This notebook shares how to:

1. **Load 53 GB csv / 8 GB parquet (500M rows) of netflow in seconds**
  * Sample data: [UGR'16 - Spanish ISP netflow](https://nesg.ugr.es/nesg-ugr16) nfdump
  * The same technique work with pcap, netflow, firewall logs, etc.
  * Computations exceed single GPU memory: BlazingSQL automatically pages in/out and uses multiple GPUs
2. **Compute a graph of IP<>IP activity**
  * Optionally split by Port
  * Regular tabular stats: top talkers, ...
  * Compute graph stats: partition/size, centrality, ...
  * Hunt: Beaconing!
4. **Visualize: GPU graph and traditional tables/CSVs**

In [1]:
# ### If empty, get data in Setup cells
# ! ls -lh csv_lan_logs/march_week5_00.csv && ls -lh lan_logs/march_week5_00*

ls: cannot access 'csv_lan_logs/march_week5_00.csv': No such file or directory


In [5]:
### If no GPU, switch servers
! nvidia-smi

Tue May 12 17:29:46 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 166...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   49C    P0    20W /  N/A |    408MiB /  5944MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0  

## Setup

* If not already installed, uncomment & run the two install commands below, restart your Jupyter kernel, then comment them out again
* For alternative BlazingSQL install: https://docs.blazingdb.com/docs/install-via-conda

In [2]:
# ! conda install -c blazingsql-nightly/label/cuda10.0 -c blazingsql-nightly -c rapidsai-nightly -c conda-forge -c defaults blazingsql python=3.7

In [3]:
#! pip install --user -q graphistry

In [4]:
#! echo '{"key": "zzz"}' > graphistry.json

In [1]:
import json, graphistry, pandas as pd
from blazingsql import BlazingContext
pd.options.display.max_rows = 1000

bc = BlazingContext()

GRAPHISTRY_KEY="ZZZ"
with open('graphistry.json') as f:
    GRAPHISTRY_KEY = json.load(f)['key']
graphistry.register(server='sandbox.graphistry.com', key=GRAPHISTRY_KEY, protocol='https')

graphistry.__version__

BlazingContext ready




'0.10.6'

Register AWS S3 bucket and create tables.

In [2]:
bc.s3('blazingsql-colab', bucket_name='blazingsql-colab')

(True,
 '',
 OrderedDict([('type', 's3'),
              ('bucket_name', 'blazingsql-colab'),
              ('access_key_id', ''),
              ('secret_key', ''),
              ('session_token', ''),
              ('encryption_type', <S3EncryptionType.NONE: 1>),
              ('kms_key_amazon_resource_name', '')]))

In [3]:
%%time
files = ['s3://blazingsql-colab/parquet_lan_logs/march_week5_0%s.parquet' % i for i in range(0,10) ] \
        + ['s3://blazingsql-colab/parquet_lan_logs/march_week5_1%s.parquet' % i for i in range(0,10) ] \
        + ['s3://blazingsql-colab/parquet_lan_logs/march_week5_2%s.parquet' % i for i in range(0,10) ] \
        + ['s3://blazingsql-colab/parquet_lan_logs/march_week5_3%s.parquet' % i for i in range(0,10) ] \
        + ['s3://blazingsql-colab/parquet_lan_logs/march_week5_4%s.parquet' % i for i in range(0,10) ] \
        + ['s3://blazingsql-colab/parquet_lan_logs/march_week5_5%s.parquet' % i for i in range(0,8) ]
print('# files', len(files))

bc.create_table('logs', files)

# files 58
CPU times: user 4.96 s, sys: 1.4 s, total: 6.36 s
Wall time: 30.6 s


In [4]:
%%time
bc.create_table('logs_10m', 's3://blazingsql-colab/parquet_lan_logs/march_week5_01.parquet')

CPU times: user 543 ms, sys: 304 ms, total: 847 ms
Wall time: 4.63 s


## Simple SQL

In [5]:
%%time
bc.sql('SELECT COUNT(src_ip) FROM logs_10m')

CPU times: user 8.82 s, sys: 766 ms, total: 9.58 s
Wall time: 9.01 s


Unnamed: 0,COUNT(src_ip)
0,10000000


In [6]:
%%time
bc.sql('SELECT COUNT(src_ip) FROM logs')

CPU times: user 9.39 s, sys: 5.19 s, total: 14.6 s
Wall time: 20.7 s


Unnamed: 0,COUNT(src_ip)
0,99999999


In [7]:
%%time
bc.sql('SELECT * FROM logs_10m LIMIT 1').head(1)

CPU times: user 3.4 s, sys: 2.35 s, total: 5.75 s
Wall time: 57.1 s


Unnamed: 0,conn_timestamp,duration,src_ip,dst_ip,src_port,dst_port,protocol,flags,tos,packets,flows,bytes,context
0,2016-03-28 01:56:25,25.996,42.219.159.197,62.162.188.46,50084,161,UDP,.A....,0,0,12,846,background


In [12]:
%%time
bc.sql('SELECT * FROM logs WHERE src_port = 22 OR dst_port = 22 ORDER BY bytes DESC LIMIT 3')

CPU times: user 17.4 s, sys: 13.2 s, total: 30.6 s
Wall time: 34.3 s


Unnamed: 0,conn_timestamp,duration,src_ip,dst_ip,src_port,dst_port,protocol,flags,tos,packets,flows,bytes,context
0,2016-03-28 15:25:16,319.732,42.219.159.181,57.41.5.186,39446,22,TCP,.AP...,0,8,1331726,1930233248,background
1,2016-03-28 06:05:08,302.988,36.178.196.253,42.219.158.213,59412,22,TCP,.AP.S.,0,40,916439,1240231829,background
2,2016-03-29 06:05:08,302.912,36.178.196.253,42.219.158.213,44434,22,TCP,.AP.S.,0,40,768130,1077796909,background


In [13]:
%%time
len(bc.sql('SELECT src_ip FROM logs WHERE src_port = 22 OR dst_port = 22 ORDER BY bytes DESC'))

CPU times: user 6.65 s, sys: 4.66 s, total: 11.3 s
Wall time: 11.3 s


2202154

In [14]:
%%time
len(bc.sql('SELECT src_ip FROM logs WHERE src_port = 22 OR dst_port = 22 ORDER BY bytes DESC')['src_ip'].unique())

CPU times: user 6.67 s, sys: 4.61 s, total: 11.3 s
Wall time: 11.2 s


2048

## Stats on `(src_ip, dst_ip)` combos: Top talkers, ...

In [None]:
%%time
query = """
        SELECT 
            * 
        FROM
            (
            SELECT
                COUNT(*) as num_records,
                src_ip,
                dst_ip,
                
                MIN(conn_timestamp) as timestamp_earliest,
                MAX(conn_timestamp) as timestamp_latest,
                
                MIN(src_port) as src_port_num_min,
                MAX(src_port) as src_port_num_max,
                MAX(src_port) - min(src_port) as src_port_width,
                
                MIN(dst_port) as dst_port_num_min,
                MAX(dst_port) as dst_port_num_max,
                MAX(dst_port) - min(dst_port) as dst_port_width,
                
                CASE WHEN MIN(src_port) < MIN(dst_port) THEN MIN(src_port) ELSE MIN(dst_port) END as port_min,
                
                SUM(packets) as packets_sum,
                MAX(packets) as packets_max,
                MIN(packets) as packets_min,
                
                SUM(flows) as flows_sum,
                MAX(flows) as flows_max,
                MIN(flows) as flows_min,
                
                SUM(bytes) as bytes_sum,
                MAX(bytes) as bytes_max,
                MIN(bytes) as bytes_min
                
            FROM logs
            
            GROUP BY
                src_ip,
                dst_ip
            ) as summary_table
                
            WHERE summary_table.num_records > 1
        
        ORDER BY num_records DESC
        """

bc.sql(query).head(10)

## ... And split on Port: `(src_ip, dst_ip, port)`

In [None]:
%%time
query2 = """
         SELECT 
             * 
         FROM
             (
             SELECT
                 COUNT(*) as num_records,
                 src_ip as source_ip,
                 dst_ip as destination_ip,
                 
                 MIN(conn_timestamp) as timestamp_earliest,
                 MAX(conn_timestamp) as timestamp_latest,
                 
                 MIN(src_port) as src_port_num_min,
                 MAX(src_port) as src_port_num_max,
                 MAX(src_port) - min(src_port) as src_port_width,
                 
                 MIN(dst_port) as dst_port_num_min,
                 MAX(dst_port) as dst_port_num_max,
                 MAX(dst_port) - min(dst_port) as dst_port_width,
                 
                 CASE WHEN src_port < dst_port THEN src_port ELSE dst_port END as port_min,
                 
                 SUM(packets) as packets_sum,
                 MAX(packets) as packets_max,
                 MIN(packets) as packets_min,
                 
                 SUM(flows) as flows_sum,
                 MAX(flows) as flows_max,
                 MIN(flows) as flows_min,
                 
                 SUM(bytes) as bytes_sum,
                 MAX(bytes) as bytes_max,
                 MIN(bytes) as bytes_min
                 
             FROM logs
             
             GROUP BY
                 src_ip,
                 dst_ip,
                 CASE WHEN src_port < dst_port THEN src_port ELSE dst_port END
             ) as summary_table
             
             WHERE summary_table.num_records > 1 OR summary_table.flows_sum > 10
             
         ORDER BY num_records DESC
         """

bc.sql(query2).head(3)

## BlazingSQL -> cudf GPU DataFrame -> pandas CPU DataFrame

In [None]:
gdf = bc.sql(query)

pdf = gdf.to_pandas()

print('GDF len: %s, PDF len: %s' % (len(gdf), len(pdf)))

#Best practice: prevent gdf reference from leaking to avoid GPU memory leaking
gdf = None

pdf.sample(3)

#### Number of unique src IPs via df

In [None]:
%%time
bc.sql("select count(distinct src_ip) from logs")

In [None]:
%%time
len(bc.sql("select src_ip from logs")['src_ip'].unique())

### Visualize! Ex: SSH

In [11]:
%%time
query3 = """
         SELECT 
             * 
         FROM
             (
             SELECT
                 COUNT(*) as num_records,
                 src_ip,
                 dst_ip,
                 
                 MIN(conn_timestamp) as timestamp_earliest,
                 MAX(conn_timestamp) as timestamp_latest,
                 
                 MIN(src_port) as src_port_num_min,
                 MAX(src_port) as src_port_num_max,
                 MAX(src_port) - min(src_port) as src_port_width,
                 
                 MIN(dst_port) as dst_port_num_min,
                 MAX(dst_port) as dst_port_num_max,
                 MAX(dst_port) - min(dst_port) as dst_port_width,
                 
                 CASE WHEN MIN(src_port) < MIN(dst_port) THEN MIN(src_port) ELSE MIN(dst_port) END as port_min,
                 
                 SUM(packets) as packets_sum,
                 MAX(packets) as packets_max,
                 MIN(packets) as packets_min,
                 
                 SUM(flows) as flows_sum,
                 MAX(flows) as flows_max,
                 MIN(flows) as flows_min,
                 
                 SUM(bytes) as bytes_sum,
                 MAX(bytes) as bytes_max,
                 MIN(bytes) as bytes_min
                 
             FROM logs
                 
                 WHERE src_port = 22 OR dst_port = 22
                 
             GROUP BY
                 src_ip,
                 dst_ip
             
             ) as summary_table
                 
                 WHERE summary_table.num_records > 10
                 
             ORDER BY num_records DESC
             """

ssh_links_gdf = bc.sql(query3)

CPU times: user 12.3 s, sys: 9.36 s, total: 21.7 s
Wall time: 21.5 s


In [13]:
graphistry\
    .edges(ssh_links_gdf.to_pandas())\
    .bind(source='src_ip', destination='dst_ip').plot()

### DNS Beaconing

* Find low variance between low-intensity communications
* Challenge: Switch to **low-entropy intervals** between communications

In [46]:
%%time
beaconing_query = """
                  SELECT 
                      src_ip, dst_ip, conn_timestamp
                  FROM logs_10m
                      WHERE src_port = 53 and bytes < 1000
                  ORDER BY src_ip, dst_ip, conn_timestamp ASC
                  """

dns_flows = bc.sql(beaconing_query)

frequent_srcs = dns_flows[['src_ip', 'dst_ip']].assign(hit=1).groupby(['src_ip', 'dst_ip']).count().reset_index()
frequent_srcs = frequent_srcs[ frequent_srcs['hit'] > 1000 ]

108
CPU times: user 3.08 s, sys: 2.24 s, total: 5.32 s
Wall time: 5.3 s


In [None]:
dns_heavy_flows = dns_flows.merge(frequent_srcs[['src_ip', 'hit']], how='inner', on='src_ip')
dns_heavy_flows = dns_heavy_flows[ dns_heavy_flows['hit'] > 1000 ].drop(columns=['hit'])
dns_heavy_flows = dns_heavy_flows.merge(frequent_srcs[['dst_ip', 'hit']], how='inner', on='dst_ip')
dns_heavy_flows = dns_heavy_flows[ dns_heavy_flows['hit'] > 1000 ].drop(columns=['hit'])

dns_heavy_flows['delta_s'] = dns_heavy_flows['conn_timestamp'].astype('datetime64[s]').astype('int64').diff()
dns_heavy_flows = dns_heavy_flows[ dns_heavy_flows['delta_s'] > 0 ]

dns_heavy_flows = dns_heavy_flows[['src_ip', 'dst_ip', 'delta_s']]
dns_heavy_flows['delta2_s'] = dns_heavy_flows['delta_s']

In [None]:
dns_heavy_flows = dns_heavy_flows.assign(hit=1).groupby(['src_ip', 'dst_ip']).agg({
    'delta_s': 'var',
    'delta2_s': 'mean',
    'hit': 'count'
}).rename(columns={'delta_s': 'var', 'delta2_s': 'mean', 'hit': 'count'}).reset_index()

dns_heavy_flows = dns_heavy_flows[ dns_heavy_flows['count'] > 1000 ]

print(len(dns_heavy_flows))

dns_flows=None
frequent_srcs=None

In [47]:
graphistry\
    .edges(dns_heavy_flows.to_pandas())\
    .bind(source='src_ip', destination='dst_ip')\
    .plot()
    

In [None]:
heavy_df.tail()

### High flows

In [None]:
%%time
gdf = bc.sql(query)
gdf = gdf[ gdf['num_records'] > 10 ]

print('# rows x columns', gdf.shape)
print('# unique src_ip', len(gdf['src_ip'].unique()))

#release gpu memory
pdf = gdf.to_pandas()
gdf = None

pdf.sample(3)

## Visualize!

In [None]:
g = graphistry\
    .edges(pdf)\
    .bind(source='source_ip', destination='destination_ip')

g.plot()