# Lab 3: Exploring JupyterLab & Apache Spark

## Goals:
* Basic Python Exercise
* Set variables and learn basic operations
* Understand differences in data structures
* Code basic if statements and for loops
* Get familiarized with JupyterLabs Interface
* Ensure JupyterLab Server is communicating with our Spark Cluster
* Ensure JupyterLab Server, Spark Cluster & Elasticsearch are communicating

## Basic Python Exercise

Let's get started with some language basics for Python.
* Set variables and learn basic operations
* Understand differences in data structures
* Code if statements and for loops

### Print statement

In [1]:
print("Hello Helk!")

Hello Helk!


### Basic Operations

In [2]:
2+4

6

In [3]:
5*6

30

### Setting variables

A variable can have a short name (like x and y) or a more descriptive name (age, dog, owner).
Rules for Python variables:
* A variable name must start with a letter or the underscore character
* A variable name cannot start with a number
* A variable name can only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )
* Variable names are case-sensitive (age, Age and AGE are three different variables)

Reference:https://www.w3schools.com/python/python_variables.asp

In [4]:
dog_name = 'Pedro'
age = 3
is_vaccinated = True
birth_year = 2015

In [5]:
is_vaccinated

True

In [6]:
dog_name

'Pedro'

### Data Types

reference: https://realpython.com/python-data-types/

**Integers**
* In Python 3, there is effectively no limit to how long an integer value can be
* It is constrained by the amount of memory your system has
* Python interprets a sequence of decimal digits without any prefix to be a decimal number

In [7]:
type(age)

int

**Strings**
* Strings are sequences of character data. The string type in Python is called str
* String literals may be delimited using either single or double quotes
* All the characters between the opening delimiter and matching closing delimiter are part of the string

In [8]:
type(dog_name)

str

**Boolean**
* Python 3 provides a Boolean data type.
* Objects of Boolean type may have one of two values, True or False

In [9]:
type(is_vaccinated)

bool

### Combining variables and operations

In [10]:
x = 4
y = 10

In [11]:
x-y

-6

In [12]:
x*y

40

In [13]:
y/x

2.5

In [14]:
y**x

10000

In [15]:
x>y

False

In [16]:
x==y

False

In [17]:
x<y

True

### Data structures

References: 
* https://realpython.com/python-lists-tuples/
* https://realpython.com/python-dicts/

#### Lists
* They are a collection of arbitrary objects, somewhat akin to an array in many other programming languages but more flexible.
* Lists are defined in Python by enclosing a comma-separated sequence of objects in square brackets ([])
* The important characteristics of Python lists are as follows:
  * Lists are ordered.
  * Lists can contain any arbitrary objects.
  * List elements can be accessed by index.
  * Lists can be nested to arbitrary depth.
  * Lists are mutable.
  * Lists are dynamic.

In [18]:
my_dog_list=['Pedro',3,True,2015]

In [19]:
my_dog_list[0]

'Pedro'

In [20]:
my_dog_list[2:4]

[True, 2015]

In [21]:
print("My dog's name is " + str(my_dog_list[0]) + " and he is " + str(my_dog_list[1]) + " years old.")

My dog's name is Pedro and he is 3 years old.


In [22]:
my_dog_list.append("tennis balls")

In [23]:
my_dog_list

['Pedro', 3, True, 2015, 'tennis balls']

### Tuples
* Tuples are identical to lists in all respects, except for the following properties:
  * Tuples are defined by enclosing the elements in parentheses (()) instead of square brackets ([]).
  * Tuples are immutable.
* Even though tuples are defined using parentheses, you still index and slice tuples using square brackets, just as for strings and lists
* Sometimes you don’t want data to be modified. If the values in the collection are meant to remain constant for the life of the program, using a tuple instead of a list guards against accidental modification

In [24]:
my_dog_tuple=('Pedro',3,True,2015)

In [25]:
my_dog_tuple

('Pedro', 3, True, 2015)

In [26]:
my_dog_tuple[1]

3

#### Dictionaries
* Dictionaries are Python’s implementation of a data structure that is more generally known as an associative array
* A dictionary consists of a collection of key-value pairs. Each key-value pair maps the key to its associated value
* You can define a dictionary by enclosing a comma-separated list of key-value pairs in curly braces ({})
* A colon (:) separates each key from its associated value

In [27]:
my_dog_dict={'name':'Pedro','age':3,'is_vaccinated':True,'birth_year':2015}

In [28]:
my_dog_dict

{'age': 3, 'birth_year': 2015, 'is_vaccinated': True, 'name': 'Pedro'}

In [29]:
my_dog_dict['age']

3

In [30]:
my_dog_dict.keys()

dict_keys(['age', 'birth_year', 'name', 'is_vaccinated'])

In [31]:
my_dog_dict.values()

dict_values([3, 2015, 'Pedro', True])

### IF statements

In [32]:
print("x = " + str(x))
print("y = " + str(y))

x = 4
y = 10


In [33]:
if x==y:
    print('yes')
else:
    print('no')

no


### FOR Loops

In [34]:
for item in my_dog_list:
    print(item)

Pedro
3
True
2015
tennis balls


In [35]:
for i in range(0,10):
    print(i*10)

0
10
20
30
40
50
60
70
80
90


## Apache Spark, Jupyter & Elasticsearch

## Check the current Spark Session via the variable spark

You control your Spark Application through a driver process called the SparkSession
* The SparkSession instance is the way Spark executes user-defined manipulations across the cluster
* There is a one-to-one correspondence between a SparkSession and a Spark Application. 
* In Scala and Python, the variable is available as **spark** when you start the console. 
* Let’s go ahead and look at the SparkSession in Python:

Reference: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple (Kindle Locations 436-439). O'Reilly Media. Kindle Edition. 

In [36]:
spark

SparkSession.sparkContext returns the underlying SparkContext

In [37]:
spark.sparkContext

## Creating a Spark Session

A SparkSession can be created using a builder pattern.
* The builder automatically reuse an existing SparkContext if one exists and creates a SparkContext if it does not exist
* You can have as many SparkSessions as you want in a single Spark application
* The common use case is to keep relational entities separate logically in catalogs per SparkSession

Reference: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-SparkSession.html

Let's create a new **Spark Session** to interact with our Elasticsearch server:

In [38]:
es_sparksession = (SparkSession
                  .builder
                  .appName("HELK")
                  .config("es.read.field.as.array.include", "tags")
                  .config("es.nodes","10.0.1.10:9200")
                  .config("es.net.http.auth.user","elastic")
                  .config("es.net.http.auth.pass","As3gura3lS3rv1d0rAm1g0!")
                  .getOrCreate()
)

Our new **Spark Session** reuses the existing **Spark Context**

In [39]:
es_sparksession.sparkContext

## Read data from the HELK via Spark SQL

### Using the Data Frame API to access Elasticsearch index (Elasticsearch-Sysmon Index)

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine
* Elasticsearch becomes a native source for Spark SQL so that data can be indexed and queried from Spark SQL transparently
* Spark SQL works with structured data - in other words, all entries are expected to have the same structure (same number of fields, of the same type and name)
* Using unstructured data (documents with different structures) is not supported and will cause problems.
* Through the **org.elasticsearch.spark.sql** package, esDF methods are available on the SQLContext API

Reference: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html

In [40]:
es_reader = (es_sparksession
          .read
          .format("org.elasticsearch.spark.sql")
          .option("inferSchema", "true")
)

In [41]:
es_sysmon = es_reader.load("logs-endpoint-winevent-sysmon-*/doc")

**Load**: Loads data from a data source and returns it as a :class`DataFrame`.
Reference: http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.load

## Print DataFrame Schema

In [42]:
es_sysmon.printSchema()

root
 |-- @date_creation: timestamp (nullable = true)
 |-- @date_creation_previous: timestamp (nullable = true)
 |-- @timestamp: timestamp (nullable = true)
 |-- @version: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Consumer: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- FileVersion: string (nullable = true)
 |-- Filter: string (nullable = true)
 |-- LogonId: string (nullable = true)
 |-- Product: string (nullable = true)
 |-- SourceProcessGuid: string (nullable = true)
 |-- TargetProcessGuid: string (nullable = true)
 |-- action: string (nullable = true)
 |-- any_ip_addr: string (nullable = true)
 |-- any_ip_geo: struct (nullable = true)
 |    |-- as_org: string (nullable = true)
 |    |-- asn: integer (nullable = true)
 |-- beat: struct (nullable = true)
 |    |-- hostname: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- version: string (nullable = true)
 |-- device_name: string (nullable = true)
 |-- dst

Filter our the data to only show certain data fields and events with the action **"processcreate"** which is Sysmon Event ID 1

In [43]:
events = es_sysmon.select("user_name","host_name","process_parent_name","process_name","action")
event_id_1 = events.filter(events.action == "processcreate")

In [44]:
event_id_1.show(10,truncate=False)

+-------------+---------------------------+--------------------+----------------+-------------+
|user_name    |host_name                  |process_parent_name |process_name    |action       |
+-------------+---------------------------+--------------------+----------------+-------------+
|local service|WDFN002.thehuntingelk.local|services.exe        |taskhost.exe    |processcreate|
|system       |dc-helk.thehuntingelk.local|svchost.exe         |WmiPrvSE.exe    |processcreate|
|system       |dc-helk.thehuntingelk.local|CollectGuestLogs.exe|cmd.exe         |processcreate|
|system       |WDHR004.thehuntingelk.local|services.exe        |taskhost.exe    |processcreate|
|system       |WDFN003.thehuntingelk.local|services.exe        |taskhost.exe    |processcreate|
|troy.ellis   |WDHR005.thehuntingelk.local|powershell.exe      |regsvr32.exe    |processcreate|
|troy.ellis   |WDHR005.thehuntingelk.local|cmd.exe             |cmd.exe         |processcreate|
|local service|WDRD004.thehuntingelk.loc