<div style="float: right; margin: 20px 20px 20px 20px"><img src="images/cheesy.jpg" width="250px"></div>

# Bro IDS to Spark: Cheesy/Easy Way
** NOTE:** This is NOT the correct way to go from Bro IDS to Spark. We're going to be using local data and a local Spark kernel which obviously won't scale at all. But if you just want to explore Spark with some small datasets this is a super **EASY** way to get started. 

All you need to install for this notebook/approach is:

    $ pip install brothon pyspark 

For the correct (more complicated) way please see our Bro IDS to Spark notebook:
- https://github.com/Kitware/BroThon/blob/master/notebooks/Bro_IDS_to_Spark.ipynb

<div style="float: right; margin: 0px 0px 0px 0px"><img src="images/bro.png" width="100px"></div>

You can test whether spark is installed correctly by starting up the spark shell.
    
    $ spark-shell

There are some warnings and stuff but if you get this you have successfully installed spark.
You can quit the shell by typing ':quit' and the scala> prompt
<div style="float: right; margin: 20px 20px 20px 20px"><img src="images/spark.png" width="250px"></div>
<div style="margin: 20px 20px 20px 20px"><img align="left" src="images/spark_shell.png" width="400px"></div>

In [1]:
from pyspark.sql import SparkSession

In [2]:
from brothon import bro_log_reader
import pandas as pd
import numpy as np

In [3]:
# Convert Bro IDS log to Pandas DataFrame
reader = bro_log_reader.BroLogReader('../data/dns.log')
dns_df = pd.DataFrame(reader.readrows())
dns_df.head()

Successfully monitoring ../data/dns.log...


Unnamed: 0,AA,RA,RD,TC,TTLs,Z,answers,id.orig_h,id.orig_p,id.resp_h,...,qclass_name,qtype,qtype_name,query,rcode,rcode_name,rejected,trans_id,ts,uid
0,False,True,True,False,36.000000,0,54.245.228.191,192.168.33.10,1030,4.2.2.3,...,C_INTERNET,1,A,guyspy.com,0,NOERROR,False,44949,2013-09-15 17:44:27.631940,CZGShC2znK1sV7jdI7
1,False,True,True,False,"1000.000000,36.000000",0,"guyspy.com,54.245.228.191",192.168.33.10,1030,4.2.2.3,...,C_INTERNET,1,A,www.guyspy.com,0,NOERROR,False,50071,2013-09-15 17:44:27.696869,CZGShC2znK1sV7jdI7
2,False,True,True,False,"60.000000,60.000000,60.000000,60.000000,60.000...",0,"54.230.86.87,54.230.86.18,54.230.87.160,54.230...",192.168.33.10,1030,4.2.2.3,...,C_INTERNET,1,A,devrubn8mli40.cloudfront.net,0,NOERROR,False,39062,2013-09-15 17:44:28.060639,CZGShC2znK1sV7jdI7
3,False,True,True,False,"60.000000,60.000000,60.000000,60.000000,60.000...",0,"54.230.86.87,54.230.86.18,54.230.84.20,54.230....",192.168.33.10,1030,4.2.2.3,...,C_INTERNET,1,A,d31qbv1cthcecs.cloudfront.net,0,NOERROR,False,7312,2013-09-15 17:44:28.141795,CZGShC2znK1sV7jdI7
4,False,True,True,False,"4993.000000,129.000000,129.000000,129.000000",0,"cdn.entrust.net.c.footprint.net,192.221.123.25...",192.168.33.10,1030,4.2.2.3,...,C_INTERNET,1,A,crl.entrust.net,0,NOERROR,False,41872,2013-09-15 17:44:28.422704,CZGShC2znK1sV7jdI7


<div style="float: right; margin: 0px 0px 0px 0px"><img src="images/cleanup.jpeg" width="150px"></div>
## Spark needs super clean data
Pandas is pretty flexible and lenient with having things like a '-' in a numeric field and fields with NaNs in them. For Spark we need to clean these up, luckily Pandas make this easy.

In [4]:
# Replace Bro '-' with NaNs and then remove any NaNs
dns_df.replace('-', np.NaN, inplace=True)
dns_df.dropna(inplace=True)

In [5]:
# Spin up a local Spark kernel
spark = SparkSession.builder.appName('my_awesome').getOrCreate()

In [6]:
# Convert to Spark DF
spark_df = spark.createDataFrame(dns_df)

In [7]:
# Some simple spark operations
num_rows = spark_df.count()
print("Number of Spark DataFrame rows: {:d}".format(num_rows))
columns = spark_df.columns
print("Columns: {:s}".format(','.join(columns)))

Number of Spark DataFrame rows: 51
Columns: AA,RA,RD,TC,TTLs,Z,answers,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,qclass,qclass_name,qtype,qtype_name,query,rcode,rcode_name,rejected,trans_id,ts,uid


In [8]:
# Some simple spark operations
spark_df.groupBy('proto').count().show()

+-----+-----+
|proto|count|
+-----+-----+
|  tcp|    3|
|  udp|   48|
+-----+-----+



<div style="float: right; margin: 0px 0px 0px -30px"><img src="images/confused.jpg" width="150px"></div>
### Note: Spark/PySpark does not like column names with a '.' in them
So for the fields like 'id.orig_h' we have to put the backticks around them ( \`id.orig_h\` )

In [9]:
# Some simple spark operations
spark_df.groupBy('`id.orig_h`', '`id.resp_h`').count().show()

+-------------+---------+-----+
|    id.orig_h|id.resp_h|count|
+-------------+---------+-----+
|192.168.33.10|  8.8.8.8|   12|
|192.168.33.10|  4.2.2.3|   39|
+-------------+---------+-----+



## Wrap Up
Well that's it for this notebook. With a few simple pip installs you are ready to try out Spark on your Bro Logs. Yes it will only work on smaller data but it gets you **'in the saddle'** quickly. You can try some stuff out, get familiar with Spark and then dive into setting it up the right way:
<div style="float: right; margin: 0px 0px 0px 0px"><img src="https://www.kitware.com/img/small_logo_over.png" width="200px"></div>
- https://github.com/Kitware/BroThon/blob/master/notebooks/Bro_IDS_to_Spark.ipynb

If you liked this notebook please visit the [BroThon](https://github.com/Kitware/BroThon) project for more notebooks and examples.
