## Tutorial 2: COVID Registered Domain Analysis

Part of the [Python security RAPIDS GPU & graph one-liners](Tutorial_0_Intro.ipynb)

All GPU Python data science tutorials: [RAPIDS Academy github](https://github.com/RAPIDSAcademy/rapidsacademy)

* Data: csv of passive DNS of COVID domains from [AlienVault OTX](https://otx.alienvault.com/), part of [The COVID-19 Cyber Threat Coalition](https://www.cyberthreatcoalition.org/)
* Let's find the top domains!

This tutorial walks through the full analysis pipeline:

* Loading a CSV with a GPU-accelerated reader (`cudf`)
* Identifying IPs with GPU-accelerated regex (`cudf`)
* Precomputing graph statistics such as centrality (`cugraph`)
* Filtering to interesting subgraphs using the statistics (`cudf`)
* Visually analyzing the results (`graphistry`)

### Setup

To setup and test your RAPIDS environment (cudf, blazingsql, graphistry, ...), use the [setup guide](setup.ipynb).

### Install data and packages

In [16]:
#! curl -sL -o corona.csv.zip "https://www.dropbox.com/s/es1rl0m4nrbs02x/corona.csv.zip?dl=1" \
#    && ls -alh corona.csv.zip \
#    && unzip  corona.csv.zip \
#    && ls -alh  corona.csv

### Imports

In [19]:
import cudf, cugraph, graphistry, json, pandas as pd
from cupy import arange
pd.options.display.max_rows = 100


### Get free Graphistry Hub account & creds at https://www.graphistry.com/get-started
### First run: set to True and fill in creds
### Future runs: set to False and erase your creds
### When done: delete graphistry.json
if False:
    #creds = {'token': '...'}
    creds = {'username': '***', 'password': '***'}
    with open('graphistry.json', 'w') as outfile:
        json.dump(creds, outfile)
with open('graphistry.json') as f:
    creds = json.load(f)

graphistry.register(
    api=3, key='', protocol='https', server='hub.graphistry.com', 
    **creds)

print('graphistry', graphistry.__version__, 'cudf', cudf.__version__)

graphistry 0.10.4 cudf 0.11.0


### Data

In [20]:
%%time
gdf = cudf.read_csv('./corona.csv')
print('shape', gdf.shape)
gdf.head(3)

shape (339760, 8)
CPU times: user 58.8 ms, sys: 92.6 ms, total: 151 ms
Wall time: 150 ms


Unnamed: 0,query,answer,rr,maptype,ttl,count,first_seen,last_seen
0,covidvolunteersbd.com,178.238.234.176,IN,A,12470,1,2020-03-27 00:01:30,2020-03-27 00:01:30
1,cpcontacts.covidvolunteersbd.com,178.238.234.176,IN,A,14399,1,2020-03-27 00:01:31,2020-03-27 00:01:31
2,www.covidvolunteersbd.com,covidvolunteersbd.com,IN,CNAME,14399,1,2020-03-27 00:01:31,2020-03-27 00:01:31


## Extract IP/Domain via GPU Regex
* Save as columns **{query,answer}\_entity**
* ... round-trip through CPU pandas for merge: slow but otherwise too tricky

In [21]:
%%time
gdf['is_query_ip'] = gdf['query'].str.match('\d+\.\d+\.\d+.\d+')
gdf['is_answer_ip'] = gdf['answer'].str.match('\d+\.\d+\.\d+.\d+')

gdf['query_domain'] = gdf['query'].str.extract('([^.]*\.[^.]+)$')
gdf['answer_domain'] = gdf['answer'].str.extract('([^.]*\.[^.]+)$')
gdf.head(3)

CPU times: user 51.7 ms, sys: 17 ms, total: 68.8 ms
Wall time: 67.5 ms


Unnamed: 0,query,answer,rr,maptype,ttl,count,first_seen,last_seen,is_query_ip,is_answer_ip,query_domain,answer_domain
0,covidvolunteersbd.com,178.238.234.176,IN,A,12470,1,2020-03-27 00:01:30,2020-03-27 00:01:30,False,True,covidvolunteersbd.com,234.176
1,cpcontacts.covidvolunteersbd.com,178.238.234.176,IN,A,14399,1,2020-03-27 00:01:31,2020-03-27 00:01:31,False,True,covidvolunteersbd.com,234.176
2,www.covidvolunteersbd.com,covidvolunteersbd.com,IN,CNAME,14399,1,2020-03-27 00:01:31,2020-03-27 00:01:31,False,False,covidvolunteersbd.com,covidvolunteersbd.com


#### Note: GPU->CPU Pandas->GPU slowdown

In [23]:
%%time
gdf['query_entity'] = cudf.Series(
    gdf[['is_query_ip', 'query_domain', 'query']].to_pandas().apply(
        lambda row: row['query_domain'] if not row['is_query_ip'] else row['query'],
        axis=1))
gdf['answer_entity'] = cudf.Series(
    gdf[['is_answer_ip', 'answer_domain', 'answer']].to_pandas().apply(
        lambda row: row['answer_domain'] if not row['is_answer_ip'] else row['answer'],
        axis=1))
gdf.head(3)

CPU times: user 12 s, sys: 0 ns, total: 12 s
Wall time: 12 s


Unnamed: 0,query,answer,rr,maptype,ttl,count,first_seen,last_seen,is_query_ip,is_answer_ip,query_domain,answer_domain,query_entity,answer_entity
0,covidvolunteersbd.com,178.238.234.176,IN,A,12470,1,2020-03-27 00:01:30,2020-03-27 00:01:30,False,True,covidvolunteersbd.com,234.176,covidvolunteersbd.com,178.238.234.176
1,cpcontacts.covidvolunteersbd.com,178.238.234.176,IN,A,14399,1,2020-03-27 00:01:31,2020-03-27 00:01:31,False,True,covidvolunteersbd.com,234.176,covidvolunteersbd.com,178.238.234.176
2,www.covidvolunteersbd.com,covidvolunteersbd.com,IN,CNAME,14399,1,2020-03-27 00:01:31,2020-03-27 00:01:31,False,False,covidvolunteersbd.com,covidvolunteersbd.com,covidvolunteersbd.com,covidvolunteersbd.com


## Precompute graph stats
* Component (weak): ID size
* Importance: Degree, pagerank, centrality, k-core
* Community: Louvain

### Generate CuGraph

In [25]:
# gdf[[src_col: 'a, dst_col: 'a]] * str * src
# => gdf[['a, int32]]
def edges_to_cugraph(df, src_col, dst_col, drop_self_loops=False):
    
    if drop_self_loops:
        df = df.copy(deep=False)
        df = df[ ~(df[src_col] == df[dst_col]) ]
    
    #gdf[['id': 'a, 'idx': int32]]
    nodes_gdf = cudf.DataFrame({
        'id': cudf.concat([ df[src_col], df[dst_col] ], ignore_index=True, sort=False).unique()
    })
    nodes_gdf['idx'] = arange(0, len(nodes_gdf), dtype='int32')
    
    #gdf[[src_col_idx, dst_col_idx]]
    edges_gdf = df[[src_col, dst_col]]\
        .merge(
            nodes_gdf.rename(columns={'idx': 'src_idx'}, copy=False),
            left_on=src_col, right_on='id')\
        .merge(
            nodes_gdf.rename(columns={'idx': 'dst_idx'}, copy=False),
            left_on=dst_col, right_on='id')
    
    G = cugraph.Graph()
    G.from_cudf_edgelist(edges_gdf, source='src_idx', target='dst_idx')
    
    return nodes_gdf, G



(nodes_gdf, G) = edges_to_cugraph(gdf, 'query_entity', 'answer_entity', drop_self_loops=True)

nodes_gdf.head(3)

Unnamed: 0,id,idx
0,0-urencontractcoronaloon.nl,0
1,0.0.0.0,1
2,000webhost.com,2


### Decorate nodes with graph stats

In [26]:
%%time

# df[[node_col, ...]] * df[['vertex', computed_col]] * str * ?str
# => df[[node_col, new_col, ... ]]
def with_vertex_calc(nodes_gdf, g_out, node_col='idx', computed_idx='vertex', computed_col='label', new_col=None):

    #print('got cols', nodes_gdf.columns, g_out.columns)
    if new_col is None:
        new_col = computed_col
        
    return nodes_gdf.merge(
        g_out[[computed_idx, computed_col]].rename(columns={
                computed_idx: node_col,
                computed_col: new_col
            }, copy=False),
        how='left',
        on=node_col)

# gdf[['idx', col, ...]] * str => gdf[['idx', col, '<col>_size':int64, ...]]
def size_by_col(nodes_gdf, col):
    #gdf[[col, <col>_size]]
    group_size = nodes_gdf[['idx', col]].groupby(col).count().reset_index().rename(columns={'idx': f'{col}_size'})
    return nodes_gdf.merge(group_size, how='left', on=col)

def decorate_graph(G, nodes_gdf):
    
    nodes_gdf = nodes_gdf.copy(deep=False)

    print('pagerank')
    nodes_gdf = with_vertex_calc(nodes_gdf, cugraph.pagerank(G), computed_col='pagerank')

    #print('katz')
    #nodes_gdf = with_vertex_calc(nodes_gdf, cugraph.katz_centrality(G, alpha=0.01), computed_col='katz_centrality')

    #print('bc')
    #nodes_gdf = with_vertex_calc(nodes_gdf, cugraph.betweenness_centrality(G), computed_col='betweenness_centrality')

    print('louvain')
    nodes_gdf = with_vertex_calc(nodes_gdf, cugraph.louvain(G)[0], computed_col='partition', new_col='louvain')
    
    print('...with size')
    nodes_gdf = size_by_col(nodes_gdf, 'louvain')

    print('weakcc')
    nodes_gdf = with_vertex_calc(nodes_gdf, cugraph.weakly_connected_components(G),
                                 computed_idx='vertices',
                                 computed_col='labels', new_col='community_weak')

    print('...with size')
    nodes_gdf = size_by_col(nodes_gdf, 'community_weak')

    print('core_number')
    nodes_gdf = with_vertex_calc(nodes_gdf, cugraph.core_number(G), computed_col='core_number')
    
    print('...with size')
    nodes_gdf = size_by_col(nodes_gdf, 'core_number')

    print('degree')
    nodes_gdf = with_vertex_calc(nodes_gdf, G.degree().assign(vertex=nodes_gdf['idx']), computed_col='degree')

    print('degrees')
    degrees = G.degrees()
    
    print('in_degree')
    nodes_gdf = with_vertex_calc(nodes_gdf, degrees, computed_col='in_degree')

    print('out_degree')
    nodes_gdf = with_vertex_calc(nodes_gdf, degrees, computed_col='out_degree')

    return nodes_gdf


nodes_decorated_gdf = decorate_graph(G, nodes_gdf)
nodes_decorated_gdf['is_ip'] = nodes_decorated_gdf['id'].str.match('\d+\.\d+\.\d+.\d+')
nodes_decorated_gdf.head(3)

pagerank
louvain
...with size
weakcc
...with size
core_number
...with size
degree
degrees
in_degree
out_degree
CPU times: user 237 ms, sys: 68.7 ms, total: 305 ms
Wall time: 306 ms


Unnamed: 0,id,idx,pagerank,louvain,louvain_size,community_weak,community_weak_size,core_number,core_number_size,degree,in_degree,out_degree,is_ip
0,covid19models.com,60352,1.2e-05,4351,5227,1,77449,4,6584,2,5,5,False
1,covid19modelselection.com,60353,9e-06,1436,33900,1,77449,1,40161,2,2,2,False
2,covid19modification.com,60354,5e-06,1436,33900,1,77449,2,28721,2,2,2,False


### Trim graph to decently-sized nodes

In [28]:
'min/max node degree', nodes_decorated_gdf['degree'].min(), nodes_decorated_gdf['degree'].max()

('min/max node degree', 2, 42632)

In [29]:
def filter_edges_by_nodes(nodes_gdf, edges_gdf, node='id', src='src', dst='dst'):
    hits = nodes_gdf[[node]]
    hits['hit'] = True
    
    return edges_gdf\
        .merge(hits.rename(columns={node: src}), how='inner', on=src)\
        .merge(hits.rename(columns={node: dst}), how='inner', on=dst)
               
nodes_subset_gdf = nodes_decorated_gdf[ 
      (nodes_decorated_gdf['community_weak_size'] > 100) \
    & (nodes_decorated_gdf['louvain_size'] > 100) ]
edges_subset_gdf = filter_edges_by_nodes(nodes_subset_gdf, gdf, src='query_entity', dst='answer_entity')

print('nodes', nodes_decorated_gdf.shape, '=>', nodes_subset_gdf.shape)
print('edges', gdf.shape, '=>', edges_subset_gdf.shape)

nodes_subset_gdf.head(3)

nodes (92195, 13) => (73564, 13)
edges (339760, 14) => (283568, 16)


Unnamed: 0,id,idx,pagerank,louvain,louvain_size,community_weak,community_weak_size,core_number,core_number_size,degree,in_degree,out_degree,is_ip
0,covid19models.com,60352,1.2e-05,4351,5227,1,77449,4,6584,2,5,5,False
1,covid19modelselection.com,60353,9e-06,1436,33900,1,77449,1,40161,2,2,2,False
2,covid19modification.com,60354,5e-06,1436,33900,1,77449,2,28721,2,2,2,False


### Let's look!

In [30]:
g = graphistry\
    .nodes(nodes_subset_gdf.to_pandas()).edges(edges_subset_gdf.to_pandas())\
    .bind(node='id', source='query_entity', destination='answer_entity')

In [31]:
g.plot()

Uploading 8189 kB. This may take a while...


## Next steps

Part of the [Python security RAPIDS GPU & graph one-liners](Tutorial_0_Intro.ipynb)

All GPU Python data science tutorials: [RAPIDS Academy github](https://github.com/RAPIDSAcademy/rapidsacademy)