# Performance Comparison

## ASONAM 2020 Tutorial
### Authors
 - Brad Rees, PhD [brees@nvidia.com]
 - Corey Noley [cnolet@nvidia.com]
 
### Development Notes
 - Developed using: RAPIDS v0.17


In [1]:
# See how many GPUs are avaialble
!nvidia-smi

Wed Dec  2 10:13:13 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro GV100        Off  | 00000000:04:00.0  On |                  Off |
| 33%   44C    P0    27W / 250W |    226MiB / 32503MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# If you have multiple GPUs, set the ones to use
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

In [3]:
import cudf
import cugraph

import pandas as pd
import networkx as nx

import numpy as np

import os
import time
import gc

# Introduction

Today, organizations collect vast amounts of network traffic and network metadata. As the volume of data collected and velocity at which it's collected continue to increase, security analysts and forensic investigators require fast triage, processing, modeling, and visualization capabilities. Using the [RAPIDS](https://rapids.ai) suite of open-source software, we demonstrate how to:

1. Triage and perform data exploration,
2. Model network data as a graph,
3. Perform graph analytics on the graph representation of the cyber network data, and
4. Prepare the results in a way that is suitable for visualization.

# Data Import and Formatting

For this tutorial, we will used the [IDS 2018 dataset](https://www.unb.ca/cic/datasets/ids-2018.html) from the [Canadian Institute for Cybersecurity](https://www.unb.ca/cic/). The data is stored as raw PCAP files in AWS, and we'll need flow-level data for this use case. To make things eaiser, we've already created bidirectional flow using the [CIC FlowMeter tool](https://github.com/ISCX/CICFlowMeter). You can download it and get started immediately using the cell below. If you wish to store the data in a different location other than the default, change the value of `BASE_DIRECTORY`.

In [4]:
BASE_DIRECTORY = "./data/"
DOWNLOAD_DIRECTORY = BASE_DIRECTORY + "cic_ids2018/"
DOWNLOAD_FILE = "Friday-02-03-2018-biflows.tar.gz"
DIR_AND_FILE = DOWNLOAD_DIRECTORY + DOWNLOAD_FILE
DATA_DIRECTORY = DOWNLOAD_DIRECTORY + DOWNLOAD_FILE.split('.')[0] + "/"

In [5]:
!mkdir -p $DOWNLOAD_DIRECTORY
!if [ ! -f $DIR_AND_FILE ]; then echo ">> Downloading data" && wget -O $DIR_AND_FILE https://rapidsai-data.s3.us-east-2.amazonaws.com/cyber/kdd2019/Friday-02-03-2018-biflows.tar.gz; else echo ">> Data already downloaded"; fi
!if [ ! -d $DATA_DIRECTORY ]; then echo ">> Extracting $DOWNLOAD_FILE to $DATA_DIRECTORY" && tar -xzf $DIR_AND_FILE -C $DOWNLOAD_DIRECTORY; else echo ">> Data already extracted to $DATA_DIRECTORY"; fi

>> Data already downloaded
>> Data already extracted to ./data/cic_ids2018/Friday-02-03-2018-biflows/


If you would prefer to create your own biflow data, [follow the directions at the bottom of this page to download the data](https://www.unb.ca/cic/datasets/ids-2018.html) to your machine. You'll then need to build and use the CIC FlowMeter tool to create biflow data.

----

### Load in `conn` (connection) logs - 

In [6]:
!du -sh $DATA_DIRECTORY

4.1G	./data/cic_ids2018/Friday-02-03-2018-biflows/


In [7]:
# get a list of files
files = []

for f in sorted(os.listdir(DATA_DIRECTORY)):
    fname = os.path.join(DATA_DIRECTORY, f)     
    files.append(fname)


In [8]:
len(files)

442

In [9]:
def read_pandas(f):
    df = pd.read_csv(f)
    return df

In [10]:
def read_cudf(f):
    df = cudf.read_csv(f)
    return df

In [11]:
# Load all the data with Pandas
def read_all_data_pandas(_files):
    data = []
    for f in _files:
        df = read_pandas(f)
        data.append(df)
  
    _df = pd.concat(data)
    del data
    return _df

In [12]:
# Load data with RAPIDS cuDF
def read_all_data_cudf(_files):
    data = []
    for f in files:
        df = read_cudf(f)
        data.append(df)

    _df = cudf.concat(data)

    del data
    return _df

In [13]:
%%time
pdf = read_all_data_pandas(files)

CPU times: user 51.4 s, sys: 8.31 s, total: 59.7 s
Wall time: 59.8 s


In [14]:
%%time
gdf = read_all_data_cudf(files)

CPU times: user 9.65 s, sys: 1.77 s, total: 11.4 s
Wall time: 11.5 s


In [15]:
(len(pdf), len(gdf))

(8217202, 8217202)

__That's it.__  data is loaded and ready for processing

----

We'll inspect the head of the new cuDF as a sanity check.

In [16]:
gdf.head(5)

Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,...,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,8.0.6.4-8.6.0.1-0-0-0,8.6.0.1,0,8.0.6.4,0,0,02/03/2018 12:50:13 PM,112640178,2,1,...,0,0.0,0.0,0.0,0.0,56320089.0,159.806133,56320202.0,56319976.0,No Label
1,8.0.6.4-8.6.0.1-0-0-0,8.6.0.1,0,8.0.6.4,0,0,02/03/2018 12:53:02 PM,112642593,2,1,...,0,0.0,0.0,0.0,0.0,56321296.5,120.91526,56321382.0,56321211.0,No Label
2,8.0.6.4-8.6.0.1-0-0-0,8.6.0.1,0,8.0.6.4,0,0,02/03/2018 12:55:51 PM,112640127,2,1,...,0,0.0,0.0,0.0,0.0,56320063.5,251.022907,56320241.0,56319886.0,No Label
3,8.0.6.4-8.6.0.1-0-0-0,8.6.0.1,0,8.0.6.4,0,0,02/03/2018 12:58:40 PM,112641715,2,1,...,0,0.0,0.0,0.0,0.0,56320857.5,67.175144,56320905.0,56320810.0,No Label
4,8.0.6.4-8.6.0.1-0-0-0,8.6.0.1,0,8.0.6.4,0,0,02/03/2018 01:01:29 PM,112642245,2,1,...,0,0.0,0.0,0.0,0.0,56321122.5,3.535534,56321125.0,56321120.0,No Label


----

# Data Exploration

## Dataset Summary

There's 84 features for each flow! You can find descriptions of all of them here: https://www.unb.ca/cic/datasets/ids-2018.html. Or look at the table below with some of the ones we'll be using highlighted below.


|Field | Description|
|---|---|
|Src IP | Source IP
|Src Port | Source Port
|Dst IP| Destination IP
|Dst Port| Destination Port
|Tot Fwd Pkts| Total Forward Packets
|Tot Bwd Pkts| Total Backward Packets
|Fwd Header Len| Forward Header Length
|Bwd Header Len| Backward Header Lenght
|Fwd Pkts/s| Foward Packets per second
|Bwd Pkts/s | Backward Packets per second

## Data Exploration

### Dataset Size and Data Types

We first get a sense of how large the dataset is, and what some column names and their associated data types are.

In [17]:
print(gdf.shape)

(8217202, 84)


In [18]:
print(gdf.dtypes)

Flow ID       object
Src IP        object
Src Port       int64
Dst IP        object
Dst Port       int64
              ...   
Idle Mean    float64
Idle Std     float64
Idle Max     float64
Idle Min     float64
Label         object
Length: 84, dtype: object


### Summary Statistics on Numeric Fields

Often it's useful to generate summary statistics on numeric fields. This is easy with the `describe()` function. Here, the output includes the minimum, maximum, mean, median, standard deviation, and various quantiles for selected fields in the dataset.

In [19]:
%%time
## - using RAPIDS
print(gdf[['Flow Duration','Tot Fwd Pkts','Tot Bwd Pkts', 'Fwd Header Len', 'Bwd Header Len', 'Fwd Pkts/s', 'Bwd Pkts/s']].describe())

       Flow Duration  Tot Fwd Pkts  Tot Bwd Pkts  Fwd Header Len  \
count   8.217202e+06  8.217202e+06  8.217202e+06    8.217202e+06   
mean    1.601290e+07  5.229578e+00  7.693485e+00    1.013261e+02   
std     3.450459e+07  5.748743e+01  1.530253e+02    1.768820e+03   
min     0.000000e+00  0.000000e+00  1.000000e+00    0.000000e+00   
25%     6.480000e+02  1.000000e+00  2.000000e+00    0.000000e+00   
50%     2.041365e+05  2.000000e+00  2.000000e+00    2.000000e+01   
75%     4.082616e+06  7.000000e+00  8.000000e+00    1.400000e+02   
max     1.200000e+08  4.315800e+04  1.220150e+05    2.275004e+06   

       Bwd Header Len    Fwd Pkts/s    Bwd Pkts/s  
count    8.217202e+06  8.217202e+06  8.217202e+06  
mean     1.597771e+02  1.196901e+04  1.589824e+04  
std      3.028208e+03  1.072087e+05  1.082609e+05  
min      0.000000e+00  0.000000e+00  0.000000e+00  
25%      2.000000e+01  0.000000e+00  1.973443e+00  
50%      4.000000e+01  2.093290e+00  1.242776e+01  
75%      1.840000e+02  

In [20]:
%%time
## - using Pandas
print(pdf[['Flow Duration','Tot Fwd Pkts','Tot Bwd Pkts', 'Fwd Header Len', 'Bwd Header Len', 'Fwd Pkts/s', 'Bwd Pkts/s']].describe())

       Flow Duration  Tot Fwd Pkts  Tot Bwd Pkts  Fwd Header Len  \
count   8.217202e+06  8.217202e+06  8.217202e+06    8.217202e+06   
mean    1.601290e+07  5.229578e+00  7.693485e+00    1.013261e+02   
std     3.450459e+07  5.748743e+01  1.530253e+02    1.768820e+03   
min     0.000000e+00  0.000000e+00  1.000000e+00    0.000000e+00   
25%     6.480000e+02  1.000000e+00  2.000000e+00    0.000000e+00   
50%     2.041365e+05  2.000000e+00  2.000000e+00    2.000000e+01   
75%     4.082616e+06  7.000000e+00  8.000000e+00    1.400000e+02   
max     1.200000e+08  4.315800e+04  1.220150e+05    2.275004e+06   

       Bwd Header Len    Fwd Pkts/s    Bwd Pkts/s  
count    8.217202e+06  8.217202e+06  8.217202e+06  
mean     1.597771e+02  1.196901e+04  1.589824e+04  
std      3.028208e+03  1.072087e+05  1.082609e+05  
min      0.000000e+00  0.000000e+00  0.000000e+00  
25%      2.000000e+01  0.000000e+00  1.973443e+00  
50%      4.000000e+01  2.093290e+00  1.242776e+01  
75%      1.840000e+02  

## Sorting and Groupby

### start with single column

In [21]:
%%time
# sort by port number
print(gdf['Dst Port'].value_counts().sort_index())

0        142358
3             1
5             2
10            3
12            1
          ...  
65531         8
65532        14
65533         8
65534        10
65535         2
Name: Dst Port, Length: 64601, dtype: int32
CPU times: user 24.9 ms, sys: 4.18 ms, total: 29.1 ms
Wall time: 27.6 ms


In [22]:
%%time
# sort by port number
print(pdf['Dst Port'].value_counts().sort_index())

0        142358
3             1
5             2
10            3
12            1
          ...  
65531         8
65532        14
65533         8
65534        10
65535         2
Name: Dst Port, Length: 64601, dtype: int64
CPU times: user 76.4 ms, sys: 23.2 ms, total: 99.7 ms
Wall time: 98.6 ms


### the data frame

In [23]:
%%time
gdf.sort_values(by='Dst Port').head(5)

CPU times: user 149 ms, sys: 32.1 ms, total: 181 ms
Wall time: 196 ms


Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,...,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,8.0.6.4-8.6.0.1-0-0-0,8.6.0.1,0,8.0.6.4,0,0,02/03/2018 12:50:13 PM,112640178,2,1,...,0,0.0,0.0,0.0,0.0,56320089.0,159.806133,56320202.0,56319976.0,No Label
1,8.0.6.4-8.6.0.1-0-0-0,8.6.0.1,0,8.0.6.4,0,0,02/03/2018 12:53:02 PM,112642593,2,1,...,0,0.0,0.0,0.0,0.0,56321296.5,120.91526,56321382.0,56321211.0,No Label
2,8.0.6.4-8.6.0.1-0-0-0,8.6.0.1,0,8.0.6.4,0,0,02/03/2018 12:55:51 PM,112640127,2,1,...,0,0.0,0.0,0.0,0.0,56320063.5,251.022907,56320241.0,56319886.0,No Label
3,8.0.6.4-8.6.0.1-0-0-0,8.6.0.1,0,8.0.6.4,0,0,02/03/2018 12:58:40 PM,112641715,2,1,...,0,0.0,0.0,0.0,0.0,56320857.5,67.175144,56320905.0,56320810.0,No Label
4,8.0.6.4-8.6.0.1-0-0-0,8.6.0.1,0,8.0.6.4,0,0,02/03/2018 01:01:29 PM,112642245,2,1,...,0,0.0,0.0,0.0,0.0,56321122.5,3.535534,56321125.0,56321120.0,No Label


In [24]:
%%time
pdf.sort_values(by='Dst Port').head(5)

CPU times: user 4.91 s, sys: 1.36 s, total: 6.27 s
Wall time: 6.25 s


Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,...,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,8.0.6.4-8.6.0.1-0-0-0,8.6.0.1,0,8.0.6.4,0,0,02/03/2018 12:50:13 PM,112640178,2,1,...,0,0.0,0.0,0.0,0.0,56320089.0,159.806133,56320202.0,56319976.0,No Label
4534,8.0.6.4-8.6.0.1-0-0-0,8.6.0.1,0,8.0.6.4,0,0,02/03/2018 07:00:01 PM,112637563,2,1,...,0,0.0,0.0,0.0,0.0,56318781.5,139.300036,56318880.0,56318683.0,No Label
4538,8.0.6.4-8.6.0.1-0-0-0,8.6.0.1,0,8.0.6.4,0,0,02/03/2018 07:02:50 PM,112637600,2,1,...,0,0.0,0.0,0.0,0.0,56318800.0,90.509668,56318864.0,56318736.0,No Label
6357,172.31.65.6-34.224.252.61-0-0-0,172.31.65.6,0,34.224.252.61,0,0,02/03/2018 02:03:20 PM,17832591,2,1,...,0,1665838.0,0.0,1665838.0,1665838.0,16166753.0,0.0,16166753.0,16166753.0,No Label
4490,8.0.6.4-8.6.0.1-0-0-0,8.6.0.1,0,8.0.6.4,0,0,02/03/2018 05:17:20 PM,112636939,2,1,...,0,0.0,0.0,0.0,0.0,56318469.5,30.405592,56318491.0,56318448.0,No Label


In [25]:
%%time
gdf.sort_values(by=['Dst IP','Dst Port']).head(5)

CPU times: user 410 ms, sys: 20.5 ms, total: 431 ms
Wall time: 454 ms


Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,...,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
18464,172.31.65.10-1.0.150.69-445-9946-6,172.31.65.10,445,1.0.150.69,9946,6,02/03/2018 02:50:56 PM,23,1,1,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,No Label
16021,172.31.65.10-1.0.150.69-445-9978-6,172.31.65.10,445,1.0.150.69,9978,6,02/03/2018 02:50:57 PM,198,1,1,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,No Label
11790,172.31.66.11-1.0.150.69-445-13063-6,172.31.66.11,445,1.0.150.69,13063,6,02/03/2018 02:51:37 PM,36,1,1,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,No Label
12588,172.31.65.72-1.0.150.69-445-31408-6,172.31.65.72,445,1.0.150.69,31408,6,02/03/2018 02:22:48 PM,20,1,1,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,No Label
13564,172.31.65.122-1.0.150.69-445-36868-6,172.31.65.122,445,1.0.150.69,36868,6,02/03/2018 02:41:36 PM,32,1,1,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,No Label


In [26]:
%%time
pdf.sort_values(by=['Dst IP','Dst Port']).head(5)

CPU times: user 4.87 s, sys: 1.56 s, total: 6.43 s
Wall time: 6.41 s


Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,...,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
18464,172.31.65.10-1.0.150.69-445-9946-6,172.31.65.10,445,1.0.150.69,9946,6,02/03/2018 02:50:56 PM,23,1,1,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,No Label
16021,172.31.65.10-1.0.150.69-445-9978-6,172.31.65.10,445,1.0.150.69,9978,6,02/03/2018 02:50:57 PM,198,1,1,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,No Label
11790,172.31.66.11-1.0.150.69-445-13063-6,172.31.66.11,445,1.0.150.69,13063,6,02/03/2018 02:51:37 PM,36,1,1,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,No Label
12588,172.31.65.72-1.0.150.69-445-31408-6,172.31.65.72,445,1.0.150.69,31408,6,02/03/2018 02:22:48 PM,20,1,1,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,No Label
13564,172.31.65.122-1.0.150.69-445-36868-6,172.31.65.122,445,1.0.150.69,36868,6,02/03/2018 02:41:36 PM,32,1,1,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,No Label


### Aggregate Statistics

Another common way to investigate a new dataset is to calculate aggregate statistics on various fields (and groupings of fields) in the data. Below, we apply a `groupby()` then a `sum()` to calculate total forward-direction (outbound) packets for each source IP address.

In [27]:
%%time
print(gdf[['Src IP','Tot Fwd Pkts']].groupby('Src IP').sum())

                Tot Fwd Pkts
Src IP                      
0.0.0.0                  702
1.0.150.69                87
1.0.196.220               10
1.1.165.90               186
1.1.174.152               44
...                      ...
99.158.245.186             2
99.178.244.230            27
99.46.42.37                1
99.47.33.152               2
99.60.19.158              91

[32674 rows x 1 columns]
CPU times: user 26.9 ms, sys: 4.63 ms, total: 31.5 ms
Wall time: 30.2 ms


In [28]:
%%time
print(pdf[['Src IP','Tot Fwd Pkts']].groupby('Src IP').sum())

                Tot Fwd Pkts
Src IP                      
0.0.0.0                  702
1.0.150.69                87
1.0.196.220               10
1.1.165.90               186
1.1.174.152               44
...                      ...
99.158.245.186             2
99.178.244.230            27
99.46.42.37                1
99.47.33.152               2
99.60.19.158              91

[32674 rows x 1 columns]
CPU times: user 765 ms, sys: 124 ms, total: 890 ms
Wall time: 887 ms


group by two columns

In [29]:
%%time
print(gdf[['Src IP','Src Port','Tot Fwd Pkts']].groupby(by=['Src IP','Src Port']).sum())

                       Tot Fwd Pkts
Src IP       Src Port              
0.0.0.0      68                 702
1.0.150.69   9946                 2
             9978                10
             11073                8
             12583                6
...                             ...
99.60.19.158 57617                8
             58136                7
             60341                8
             61756                7
             64300                6

[4445260 rows x 1 columns]
CPU times: user 486 ms, sys: 27.6 ms, total: 514 ms
Wall time: 519 ms


In [30]:
%%time
print(pdf[['Src IP','Src Port','Tot Fwd Pkts']].groupby(by=['Src IP','Src Port']).sum())

                       Tot Fwd Pkts
Src IP       Src Port              
0.0.0.0      68                 702
1.0.150.69   9946                 2
             9978                10
             11073                8
             12583                6
...                             ...
99.60.19.158 57617                8
             58136                7
             60341                8
             61756                7
             64300                6

[4445260 rows x 1 columns]
CPU times: user 2.5 s, sys: 432 ms, total: 2.93 s
Wall time: 2.92 s


------
Clean up

In [31]:
del pdf
del gdf
gc.collect()

108

In [32]:
pdf = read_all_data_pandas(files)
gdf = read_all_data_cudf(files)

-------------
# Graph Representation of the Network Data

Networks (including cybersecurity networks) are frequently interpreted and represented as graphs. A graph representation affords us many benefits during analysis, including using both the structure, edge features, and generated features for anomaly detection. We first demonstrate how to create a [cuGraph](https://github.com/rapidsai/cugraph) representation of graph data represented in cuDF, then we walk through some analysis.

In [1]:
%%time
G = cugraph.from_cudf_edgelist(gdf, source='Src IP', destination='Dst IP', renumber=True, create_using=cugraph.Graph )

NameError: name 'cugraph' is not defined

### Calculating the Degree

We'll find the number of connections at each node. This is often useful to see what nodes have the most connections, as these are typically backbone assets of the network.

In [None]:
%%time
deg = G.degree()

In [None]:
# top 3 most connected vertices 
deg.sort_values('degree', ascending=False).head(3)

__now Nx__

In [None]:
%%time
Gnx = nx.from_pandas_edgelist(pdf, source='Src IP', target='Dst IP', create_using=nx.Graph() )

In [None]:
%%time
deg_p = Gnx.degree()

In [None]:
# top 3 most connected vertices 
sorted(deg_p, key=lambda x: x[1], reverse=True)[0:3]

## PageRank

PageRank (PR) is a [fairly well-known algorithm](https://en.wikipedia.org/wiki/PageRank), originally developed to rank web pages in Google search results. Traditionally, the PageRank algorithm outputs a probability distribution which represents the likelihood that a person randomly clicking on links will arrive at any particular page. We can use that same property to rank states of an attack graph.

In [None]:
%%time
# Call cugraph.pagerank to get the pagerank scores
gdf_pr = cugraph.pagerank(G)

In [None]:
# In order to find the most important node, we first find the maximum PR value
print(gdf_pr['pagerank'].max() )

In [None]:
gdf_pr.sort_values('pagerank', ascending=False).head(3)

__NX__

In [None]:
%%time
# Call cugraph.pagerank to get the pagerank scores
nx_pr = nx.pagerank(Gnx)

In [None]:
%%time
sorted(nx_pr.items(), key=lambda x: x[1], reverse=True)[0:3]

## cool things with cuGraph
__use a NetworkX object__

In [None]:
%%time
nx_pr2 = cugraph.pagerank(Gnx)

In [None]:
%%time
sorted(nx_pr2.items(), key=lambda x: x[1], reverse=True)[0:3]

__Multi-CVolumn vertex IDs__

In [None]:
%%time
G2 = cugraph.from_cudf_edgelist(gdf, source=['Src IP','Src Port'], destination=['Dst IP','Dst Port'], renumber=True, create_using=cugraph.Graph )

In [None]:
%%time
gdf_pr2 = cugraph.pagerank(G2)

In [None]:
gdf_pr2.sort_values('pagerank', ascending=False).head(3)

<hr />

## Acknowledgmnets

We would like to thank the [Canadian Institute for Cybersecurity](https://www.unb.ca/cic/) for the data used in this tutorial. A complete description of the dataset used is [available online](https://registry.opendata.aws/cse-cic-ids2018/). In addition, the paper associated with this dataset is:

> Iman Sharafaldin, Arash Habibi Lashkari, and Ali A. Ghorbani, “Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization”, 4th International Conference on Information Systems Security and Privacy (ICISSP), Portugal, January 2018

We would also like to acknowledge the contributions of Eli Fajardo (NVIDIA), Brad Rees, PhD (NVIDIA), and the [RAPIDS](https://rapids.ai) engineering team.