# **HELK: Basic Sysmon ProcessCreate Graph Query**
## Goals:
* Confirm Jupyter can talk to Spark & Graphframes
* Confirm Spark & Graphframes can pull data from ES
* Confirm Spark Standalone Cluster Manager works (2 Workers)
* Create a graphframe from sysmon Index
  * Creating vertices and edges dataframes
* Running a basic query using GraphFrames Motifs

## Check the Spark Context via the variable spark

In [1]:
spark

## Import Graphframes & SQL Functions

In [2]:
from graphframes import *

In [3]:
from pyspark.sql.functions import *

## Set a Custom Spark Session

In [4]:
spark_graph = SparkSession \
    .builder \
    .appName("HELK") \
    .config("es.read.field.as.array.include", "tags") \
    .config("es.nodes","helk-elasticsearch:9200") \
    .config("es.net.http.auth.user","elastic") \
    .config("es.net.http.auth.pass","elasticpassword") \
    .getOrCreate()

## Read data from the HELK (Elasticsearch-Sysmon Index)

In [5]:
df = spark_graph.read.format("org.elasticsearch.spark.sql").load("logs-endpoint-winevent-sysmon-*/doc")

## Print DataFrame Schema

In [6]:
df.printSchema()

root
 |-- @date_creation: timestamp (nullable = true)
 |-- @date_creation_previous: timestamp (nullable = true)
 |-- @timestamp: timestamp (nullable = true)
 |-- @version: string (nullable = true)
 |-- action: string (nullable = true)
 |-- beat: struct (nullable = true)
 |    |-- hostname: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- version: string (nullable = true)
 |-- device_name: string (nullable = true)
 |-- event_id: integer (nullable = true)
 |-- file_company: string (nullable = true)
 |-- file_description: string (nullable = true)
 |-- file_name: string (nullable = true)
 |-- file_product: string (nullable = true)
 |-- file_version: string (nullable = true)
 |-- geoip: struct (nullable = true)
 |    |-- city_name: string (nullable = true)
 |    |-- continent_code: string (nullable = true)
 |    |-- country_code2: string (nullable = true)
 |    |-- country_code3: string (nullable = true)
 |    |-- country_name: string (nullable = true)
 |    |-- d

## Create Vertices & Edges Dataframes

In [7]:
vertices = df.withColumn("id", df.process_guid).select("id","user_name","host_name","process_parent_name","process_name","action")
vertices = vertices.filter(vertices.action == "processcreate")

In [8]:
vertices.show(3,truncate=False)

+------------------------------------+---------+--------------+-------------------+-------------+-------------+
|id                                  |user_name|host_name     |process_parent_name|process_name |action       |
+------------------------------------+---------+--------------+-------------------+-------------+-------------+
|A98268C1-B131-5AEA-0000-00100AAB6D01|wardog   |DESKTOP-WARDOG|svchost.exe        |taskhostw.exe|processcreate|
|A98268C1-B18F-5AEA-0000-001077476F01|SYSTEM   |DESKTOP-WARDOG|svchost.exe        |wermgr.exe   |processcreate|
|A98268C1-B1A0-5AEA-0000-001069F07001|SYSTEM   |DESKTOP-WARDOG|svchost.exe        |TiWorker.exe |processcreate|
+------------------------------------+---------+--------------+-------------------+-------------+-------------+
only showing top 3 rows



In [9]:
edges = df.filter(df.action == "processcreate").selectExpr("process_parent_guid as src","process_guid as dst").withColumn("relationship", lit("spawned"))

In [10]:
edges.show(3,truncate=False)

+------------------------------------+------------------------------------+------------+
|src                                 |dst                                 |relationship|
+------------------------------------+------------------------------------+------------+
|A98268C1-9E15-5AE6-0000-0010A2BD0000|A98268C1-B0A3-5AEA-0000-0010CA9C6B01|spawned     |
|A98268C1-9E15-5AE6-0000-0010A2BD0000|A98268C1-B086-5AEA-0000-0010730F6901|spawned     |
|A98268C1-9E15-5AE6-0000-0010A2BD0000|A98268C1-B0F3-5AEA-0000-001043326D01|spawned     |
+------------------------------------+------------------------------------+------------+
only showing top 3 rows



## Create a Graph (Vertices & Edges DataFrames)

In [11]:
g = GraphFrame(vertices, edges)

## Look for (Process A spawning Process B AND Process B Spawning Process C) 

In [12]:
motifs = g.find("(a)-[]->(b);(b)-[]->(c)")

In [13]:
motifs.select("a.process_parent_name","a.process_name","b.process_parent_name","b.process_name","c.process_parent_name","c.process_name").show(10,truncate=False)

+-----------------------+------------+-------------------+-------------------+-------------------+----------------------+
|process_parent_name    |process_name|process_parent_name|process_name       |process_parent_name|process_name          |
+-----------------------+------------+-------------------+-------------------+-------------------+----------------------+
|smss.exe               |wininit.exe |wininit.exe        |services.exe       |services.exe       |svchost.exe           |
|wininit.exe            |services.exe|services.exe       |SearchIndexer.exe  |SearchIndexer.exe  |SearchProtocolHost.exe|
|smss.exe               |wininit.exe |wininit.exe        |services.exe       |services.exe       |svchost.exe           |
|smss.exe               |smss.exe    |smss.exe           |wininit.exe        |wininit.exe        |lsass.exe             |
|ManagementAgentHost.exe|cmd.exe     |cmd.exe            |cmd.exe            |cmd.exe            |findstr.exe           |
|null                   