### CLX

CLX ("clicks") provides a collection of RAPIDS examples for security analysts, data scientists, and engineers to quickly get started applying RAPIDS and GPU acceleration to real-world cybersecurity use cases.

The goal of CLX is to:

- Allow cyber data scientists and SecOps teams to generate workflows, using cyber-specific GPU-accelerated primitives and methods, that let them interact with code using security language,
- Make available pre-built use cases that demonstrate CLX and RAPIDS functionality that are ready to use in a Security Operations Center (SOC),
- Accelerate log parsing in a flexible, non-regex method. and
- Provide SIEM integration with GPU compute environments via RAPIDS and effectively extend the SIEM environment.

[GitHub](https://github.com/rapidsai/clx) | [Welcome Notebook](../welcome.ipynb#Cyber-Log-Accelerators)

In [1]:
import cudf
import s3fs
from os import path

# download data
if not path.exists("./splunk_faker_raw4"):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get("rapidsai-data/cyber/clx/splunk_faker_raw4", "./splunk_faker_raw4")

# read in alert data
gdf = cudf.read_csv('./splunk_faker_raw4')
gdf.columns = ['raw']

In [2]:
# parse the alert data using CLX built-in parsers
from clx.parsers.splunk_notable_parser import SplunkNotableParser

snp = SplunkNotableParser()
parsed_gdf = cudf.DataFrame()
parsed_gdf = snp.parse(gdf, 'raw')

In [3]:
# define function to round time to the day
def round2day(epoch_time):
    return int(epoch_time/86400)*86400

# aggregate alerts by day
parsed_gdf['time'] = parsed_gdf['time'].astype(int)
parsed_gdf['day'] = parsed_gdf.time.applymap(round2day)
day_rule_gdf = parsed_gdf[['search_name','day','time']].groupby(['search_name', 'day']).count().reset_index()
day_rule_gdf.columns = ['rule', 'day', 'count']

In [4]:
# import the rolling z-score function from CLX statistics
from clx.analytics.stats import rzscore

# pivot the alert data so each rule is a column
def pivot_table(gdf, index_col, piv_col, v_col):
    index_list = gdf[index_col].unique()
    piv_gdf = cudf.DataFrame()
    piv_gdf[index_col] = index_list
    for group in gdf[piv_col].unique():
        
        temp_df = gdf[gdf[piv_col] == group]
        temp_df = temp_df[[index_col, v_col]]
        temp_df.columns = [index_col, group]
        piv_gdf = piv_gdf.merge(temp_df, on=[index_col], how='left')
        
    piv_gdf = piv_gdf.set_index(index_col)
    return piv_gdf.sort_index()

alerts_per_day_piv = pivot_table(day_rule_gdf, 'day', 'rule', 'count').fillna(0)

In [9]:
alerts_per_day_piv

Unnamed: 0_level_0,Access - Brute Force Access Behavior Detected - Rule,Access - Geographically Improbable Access Detected - Rule,Access - Privileged User Accessing More Than Expected Number of Machines in Period - Rule,Access - Short-lived Account Detected - Rule,Access - Silver Bullet for InfoSec Algo - Machine Learning,Audit - Indexes not receiving data - Rule,Endpoint - Brute Force against Known User - Rule,Endpoint - FireEye NX alert for Incident Review - Rule,Endpoint - Host With Malware Detected (Quarantined or Waived) - Rule,Endpoint - Host With Malware Detected - Rule,...,Splunk - Detection of DNS Tunnels - Rule,Threat - Authenticated communication from a risky source network - system - Rule,Threat - Beta Testing - Machine Learning,Threat - Compromised account - Rule,Threat - File Name Matches - Threat Gen,Threat - FireEye NX alert for Incident Review - Rule,Threat - Host not sending data - Rule,Threat - Magic Artifical Intelligence Algo - Machine Learning,Threat - Network Resolution Matches - Threat Gen,Threat - Source And Destination Matches - Threat Gen
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1546300800,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1546387200,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1546473600,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1546560000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1546646400,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1568332800,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1568419200,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1568505600,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1568592000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
# create a new cuDF with the rolling z-score values calculated
r_zscores = cudf.DataFrame()
for rule in alerts_per_day_piv.columns:
    x = alerts_per_day_piv[rule]
    r_zscores[rule] = rzscore(x, 7)  # 7 day window

In [8]:
r_zscores

Unnamed: 0_level_0,Access - Brute Force Access Behavior Detected - Rule,Access - Geographically Improbable Access Detected - Rule,Access - Privileged User Accessing More Than Expected Number of Machines in Period - Rule,Access - Short-lived Account Detected - Rule,Access - Silver Bullet for InfoSec Algo - Machine Learning,Audit - Indexes not receiving data - Rule,Endpoint - Brute Force against Known User - Rule,Endpoint - FireEye NX alert for Incident Review - Rule,Endpoint - Host With Malware Detected (Quarantined or Waived) - Rule,Endpoint - Host With Malware Detected - Rule,...,Splunk - Detection of DNS Tunnels - Rule,Threat - Authenticated communication from a risky source network - system - Rule,Threat - Beta Testing - Machine Learning,Threat - Compromised account - Rule,Threat - File Name Matches - Threat Gen,Threat - FireEye NX alert for Incident Review - Rule,Threat - Host not sending data - Rule,Threat - Magic Artifical Intelligence Algo - Machine Learning,Threat - Network Resolution Matches - Threat Gen,Threat - Source And Destination Matches - Threat Gen
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1546300800,,,,,,,,,,,...,,,,,,,,,,
1546387200,,,,,,,,,,,...,,,,,,,,,,
1546473600,,,,,,,,,,,...,,,,,,,,,,
1546560000,,,,,,,,,,,...,,,,,,,,,,
1546646400,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1568332800,,,,,,,,,,,...,,,,,,,,,,
1568419200,,,,,,,,,,,...,,,,,,,,,,
1568505600,,,,,,,,,,,...,,,,,,,,,,
1568592000,,,,,,,,,,,...,,,,,,,,,,


In [6]:
from blazingsql import BlazingContext

# connect to BlazingSQL w/ BlazingContext API
bc = BlazingContext(pool=False)

BlazingContext ready


In [7]:
import os

# BlazingContext requires full data path
data_path = f'{os.getcwd().split("/intro_notebooks")[0]}/data/karate.csv'

# what's the data's path?
print(f"data_path == '{data_path}'\n")

# create a BlazingSQL table from any file w/ .create_table(table_name, file_path)
bc.create_table('karate', data_path, header=0)

data_path == '/jupyterhub-homes/winston@blazingdb.com/bsql-demos/welcome_notebooks/data/karate.csv'



<pyblazing.apiv2.context.BlazingTable at 0x7f11d4760850>

In [3]:
bc.sql('select * from karate')

Unnamed: 0,1 0 1.0
0,2 0 1.0
1,3 0 1.0
2,4 0 1.0
3,5 0 1.0
4,6 0 1.0
...,...
150,30 32 1.0
151,30 33 1.0
152,31 32 1.0
153,31 33 1.0
