# Lab 10: Basic Sysmon Graphing

## Goals:
* Learn the basics of Graphing with Spark & Graphframes
* Confirm Jupyter can talk to Spark & Graphframes
* Confirm Spark & Graphframes can manipulate data from Elasticsearch
* Create a graphframe from sysmon Index
* Learn to create vertices and edges dataframes
* Learning the basics of GraphFrames Motifs

## Apache Spark, Jupyter & Elasticsearch

## Check the current Spark Session via the variable spark

You control your Spark Application through a driver process called the SparkSession
* The SparkSession instance is the way Spark executes user-defined manipulations across the cluster
* There is a one-to-one correspondence between a SparkSession and a Spark Application. 
* In Scala and Python, the variable is available as **spark** when you start the console. 
* Let’s go ahead and look at the SparkSession in Python:

Reference: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple (Kindle Locations 436-439). O'Reilly Media. Kindle Edition. 

In [1]:
spark

SparkSession.sparkContext returns the underlying SparkContext

## Creating a Spark Session

A SparkSession can be created using a builder pattern.
* The builder automatically reuse an existing SparkContext if one exists and creates a SparkContext if it does not exist
* You can have as many SparkSessions as you want in a single Spark application
* The common use case is to keep relational entities separate logically in catalogs per SparkSession

Reference: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-SparkSession.html

Let's create a new **Spark Session** to interact with our Elasticsearch server:

In [2]:
es_sparksession = (SparkSession
                  .builder
                  .appName("HELK")
                  .config("es.read.field.as.array.include", "tags")
                  .config("es.nodes","10.0.1.10:9200")
                  .config("es.net.http.auth.user","elastic")
                  .config("es.net.http.auth.pass","As3gura3lS3rv1d0rAm1g0!")
                  .getOrCreate()
)

## Read data from the HELK via Spark SQL

### Using the Data Frame API to access Elasticsearch index (Elasticsearch-Sysmon Index)

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine
* Elasticsearch becomes a native source for Spark SQL so that data can be indexed and queried from Spark SQL transparently
* Spark SQL works with structured data - in other words, all entries are expected to have the same structure (same number of fields, of the same type and name)
* Using unstructured data (documents with different structures) is not supported and will cause problems.
* Through the **org.elasticsearch.spark.sql** package, esDF methods are available on the SQLContext API

Reference: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html

In [3]:
es_reader = (es_sparksession
          .read
          .format("org.elasticsearch.spark.sql")
          .option("inferSchema", "true")
)

In [4]:
%%time
sysmon_df = es_reader.load("logs-endpoint-winevent-sysmon-*")

CPU times: user 2.1 ms, sys: 0 ns, total: 2.1 ms
Wall time: 5.06 s


## Import Graphframes & SQL Functions

In [7]:
from graphframes import *
from pyspark.sql.functions import *

### Graphframes
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. 
* It provides high-level APIs in Scala, Java, and Python. 
* It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames.
* This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries.

### But, What does it do?
GraphFrames represent graphs: 
* Vertices (e.g., users)
* Edges (e.g., relationships between users).

In [8]:
%%time
# Create a Vertex DataFrame with unique ID column "id"# Creat 
v = sqlContext.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = sqlContext.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)

# Query: Get in-degree of each vertex.
g.inDegrees.show()

# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

+---+--------+
| id|inDegree|
+---+--------+
|  c|       1|
|  b|       2|
+---+--------+

CPU times: user 185 ms, sys: 12.5 ms, total: 198 ms
Wall time: 11.2 s


# ProcessCreate & Motifs

## Create Vertices Dataframe

We are going to replace the columna name from **process_guid** to **id** because thats the value that graphframes uses

**withColumn**(colName, col)
* Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

Parameters:	
* **colName** – string, name of the new column.
* **col** – a Column expression for the new column.


In [10]:
vertices = (sysmon_df.withColumn("id", sysmon_df.process_guid)
            .select("id","user_name","host_name","process_parent_name","process_name","action")
           )
vertices = vertices.filter(vertices.action == "processcreate")

In [15]:
%%time
vertices.show(5,truncate=False)

+------------------------------------+-------------+---------------------------+-------------------+------------+-------------+
|id                                  |user_name    |host_name                  |process_parent_name|process_name|action       |
+------------------------------------+-------------+---------------------------+-------------------+------------+-------------+
|CB6FAB7D-3834-5B99-0000-001057337A01|local service|WDRD005.thehuntingelk.local|services.exe       |taskhost.exe|processcreate|
|0300CBFA-3849-5B99-0000-001068EF9C00|local service|WDFN004.thehuntingelk.local|services.exe       |taskhost.exe|processcreate|
|51B208EE-38C8-5B99-0000-0010DF499B00|local service|WDHR002.thehuntingelk.local|services.exe       |taskhost.exe|processcreate|
|5CFEADD3-3943-5B99-0000-001022937B01|local service|WDRD001.thehuntingelk.local|services.exe       |taskhost.exe|processcreate|
|7F66EA28-3977-5B99-0000-0010CF65F900|local service|WDHR005.thehuntingelk.local|services.exe       |task

## Create Edges Dataframe

We are going to make sure we also rename our **process_parent_guid** to **src** and **process_guid** to **dst**. This is to look for that relationship across our whole environment

**selectExpr**

You can also combine selecting columns and renaming columns in a single step with selectExpr

In [17]:
edges = (sysmon_df
         .filter(sysmon_df.action == "processcreate")
         .selectExpr("process_parent_guid as src","process_guid as dst")
         .withColumn("relationship", lit("spawned"))
        )

In [18]:
%%time
edges.show(5,truncate=False)

+------------------------------------+------------------------------------+------------+
|src                                 |dst                                 |relationship|
+------------------------------------+------------------------------------+------------+
|8A57C8BC-4E83-5B98-0000-0010B66B0000|8A57C8BC-B7C0-5B99-0000-0010E286C800|spawned     |
|5CFEADD3-4E41-5B98-0000-0010809B0000|5CFEADD3-B7BB-5B99-0000-0010E4142502|spawned     |
|B4F10000-B7D3-5B99-0000-00105A8B0E02|B4F10000-B7D4-5B99-0000-00109DF40E02|spawned     |
|B4F10000-B7A5-5B99-0000-0010FFCB0502|B4F10000-B7D5-5B99-0000-001051470F02|spawned     |
|B4F10000-B7A4-5B99-0000-0010C6920502|B4F10000-B7D3-5B99-0000-00105A8B0E02|spawned     |
+------------------------------------+------------------------------------+------------+
only showing top 5 rows

CPU times: user 0 ns, sys: 2.62 ms, total: 2.62 ms
Wall time: 4.86 s


## Create a Graph (Vertices & Edges DataFrames)

In [19]:
g = GraphFrame(vertices, edges)

## Process A spawning Process B AND Process B Spawning Process C

In [20]:
%%time
motifs = g.find("(a)-[]->(b);(b)-[]->(c)")

CPU times: user 1.64 ms, sys: 0 ns, total: 1.64 ms
Wall time: 1.15 s


In [22]:
%%time
(motifs
     .select("a.process_parent_name","a.process_name","b.process_name","c.process_name")
     .show(20,truncate=False)
)

+-------------------+------------+--------------------------+--------------+
|process_parent_name|process_name|process_name              |process_name  |
+-------------------+------------+--------------------------+--------------+
|smss.exe           |wininit.exe |services.exe              |svchost.exe   |
|winlogon.exe       |userinit.exe|explorer.exe              |ie4uinit.exe  |
|services.exe       |msiexec.exe |msiexec.exe               |regtlibv12.exe|
|msiexec.exe        |msiexec.exe |ngen.exe                  |mscorsvw.exe  |
|msiexec.exe        |msiexec.exe |ngen.exe                  |mscorsvw.exe  |
|wininit.exe        |services.exe|mscorsvw.exe              |mscorsvw.exe  |
|wininit.exe        |services.exe|mscorsvw.exe              |mscorsvw.exe  |
|wininit.exe        |services.exe|mscorsvw.exe              |mscorsvw.exe  |
|wininit.exe        |services.exe|mscorsvw.exe              |mscorsvw.exe  |
|wininit.exe        |services.exe|mscorsvw.exe              |mscorsvw.exe  |

In [26]:
%%time
motifs.groupby('a.process_parent_name').count().sort('count').show(10)

+--------------------+-----+
| process_parent_name|count|
+--------------------+-----+
|         explore.exe|    1|
|        regsvr32.exe|    2|
|     syslauncher.exe|    2|
|      powershell.exe|    9|
|WindowsAzureGuest...|   10|
|        WmiPrvSE.exe|   11|
|      WaAppAgent.exe|   26|
|             cmd.exe|   52|
|         svchost.exe|   54|
|        explorer.exe|   58|
+--------------------+-----+
only showing top 10 rows

CPU times: user 4.81 ms, sys: 868 µs, total: 5.68 ms
Wall time: 15.1 s


### HOW CAN YOU GO DEEPER NOW AND GET MORE INFORMATION ABOUT THOSE?? IT IS NOT EXPLORE.EXE BY ITSELF ANYMORE..

syslauncher???