<a href="https://colab.research.google.com/github/HansHenseler/masdav2022/blob/main/Part_4_Elasticsearch_and_log2timeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Opensearch and log2timeline

Exercise 4: 

Master of Advanced Studies in Digital Forensics & Cyber Investigation

Data Analytics and Visualization for Digital Forensics

(c) Hans Henseler, 2022


## 1 Installing plaso tools in the colab notebook

First install Plaso-tools as we did in exercise 3

In [None]:
# various install steps to install plaso tools and dependencies to get plaso working in colab
# -y option is to skip user interaction
# some packages need to be deinstalled and reinstalled to resolve dependencies
# these steps take app. 3 minutes to complete on a fresh colab instance
!add-apt-repository -y ppa:gift/stable
!apt update
!apt-get update
!apt install plaso-tools
!pip uninstall -y pytsk3
!pip install pytsk3
!pip uninstall -y yara-python
!pip install yara-python
!pip uninstall -y lz4
!pip install lz4

In [2]:
# This notebook was tested with version 20220724 (psort.py assumes opensearch and longer elasticsearch)

!psort.py -V

plaso - psort version 20220724


In [3]:
# check if plaso tools were installed by running psort.py

!psort.py -o list



******************************** Output Modules ********************************
         Name : Description
--------------------------------------------------------------------------------
      dynamic : Dynamic selection of fields for a separated value output
                format.
         json : Saves the events into a JSON format.
    json_line : Saves the events into a JSON line format.
          kml : Saves events with geography data into a KML format.
       l2tcsv : CSV format used by legacy log2timeline, with 17 fixed fields.
       l2ttln : Extended TLN 7 field | delimited output.
         null : Output module that does not output anything.
   opensearch : Saves the events into an OpenSearch database.
opensearch_ts : Saves the events into an OpenSearch database for use with
                Timesketch.
        rawpy : native (or "raw") Python output.
          tln : TLN 5 field | delimited output.
         xlsx : Excel Spreadsheet (XLSX) output
----------------------------

## 2 Download and setup the Opensearch instance

There are different was to install and use Opensearch. Because we are working in a notebook we will download and install using a tar-ball which can be downloaded from: https://opensearch.org/downloads.html .

In [4]:
!wget -q https://artifacts.opensearch.org/releases/bundle/opensearch/2.2.0/opensearch-2.2.0-linux-x64.tar.gz
!wget -q https://artifacts.opensearch.org/releases/bundle/opensearch/2.2.0/opensearch-2.2.0-linux-x64.tar.gz.sha512
!tar -zxf opensearch-2.2.0-linux-x64.tar.gz
!shasum -a 512 -c opensearch-2.2.0-linux-x64.tar.gz.sha512

opensearch-2.2.0-linux-x64.tar.gz: OK


In [5]:
# change the owner of the Opensearch filetree from root to daemon because Opensearch cannot run as root.

!sudo chown -R daemon:daemon opensearch-2.2.0/

Run Elasticsearch as a daemon process

In [6]:
# start Opensearch from the commandline as user daemon. 

!sudo -H -u daemon sh -c  "opensearch-2.2.0/opensearch-tar-install.sh 1> /content/opensearch-2.2.0/logs/os.log 2> /content/opensearch-2.2.0/logs/os.err &"

In [8]:
# Sleep for few seconds to let the instance start.
import time
time.sleep(20)

Once the instance has been started, grep for elasticsearch in the processes list to confirm the availability.

In [9]:
!ps -ef | grep opensearch

daemon      7868       1 86 21:05 ?        00:00:42 /content/opensearch-2.2.0/jdk/bin/java -Xshare:auto -Dopensearch.networkaddress.cache.ttl=60 -Dopensearch.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/opensearch-6940095575402420919 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Dclk.tck=100 -Djdk.attach.allowAttachSelf=true -Djava.security.policy=/content

Query the base endpoint at localhost port 9200 to retrieve information about the cluster. Opensearch requires https, username and password.

In [10]:
!curl -XGET https://localhost:9200 -u 'admin:admin' --insecure

{
  "name" : "a60adf93bf0b",
  "cluster_name" : "opensearch",
  "cluster_uuid" : "dgyuJXYETfKoTk-6wL_G-w",
  "version" : {
    "distribution" : "opensearch",
    "number" : "2.2.0",
    "build_type" : "tar",
    "build_hash" : "b1017fa3b9a1c781d4f34ecee411e0cdf930a515",
    "build_date" : "2022-08-09T02:27:25.256769336Z",
    "build_snapshot" : false,
    "lucene_version" : "9.3.0",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}


In [None]:
# After installation there is only one index that is the  .opendistro_security index that is used for internal purposes

!curl -XGET "https://localhost:9200/_cat/indices?v" -u 'admin:admin' --insecure  

health status index                uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .opendistro_security a2dJDoODRAWjjqnKuWByyA   1   0         10            0     69.1kb         69.1kb


## 3 Use Log2timeline.py and Psort.py to load data in Elasticsearch

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
# In part 3 (step 3) we stored the mus2019ctf.plaso file in your drive. 
#
plaso_file = 'gdrive/MyDrive/mus2019ctf.plaso'
#
# and check if it's there
#
!ls -l $plaso_file

-rw------- 1 root root 403488768 Aug 15 15:17 gdrive/MyDrive/mus2019ctf.plaso


In [None]:
# If it's not there you can create it by repeating the following steps
# 
# The complete mus2019ctf.plaso file is 450MB and takes a while. After you have created it
# it makes sense to store it in your gdrive so you can reuse it:
#
#plaso_file = 'gdrive/MyDrive/Colab\ Notebooks/Data\ Analytics\ and\ Visualisation\ Course/mus2019ctf.plaso'
#
# if not you need to create it with log2timeline.py using the complete windows_filter.txt filter
#
# add a shortcut in your Google drive to this shared drive https://drive.google.com/drive/folders/1KUlZUl4Sy2JzgbuRW-oHjIGFClY2bl75?usp=sharing
# then mount you google drive in this colab (you need to authorize this colab to access your google drive)
#
#disk_image = "/content/gdrive/MyDrive/Images/Windows/MUS-CTF-19-DESKTOP-001.E01"
#plaso_gdrive_folder = 'gdrive/MyDrive'
#!wget "https://raw.githubusercontent.com/mark-hallman/plaso_filters/master/filter_windows.txt"
#!log2timeline.py -f filter_windows.txt --storage-file mus2019ctf.plaso $disk_image 
#!ls -l $disk_image
#!cp mus2019ctf.plaso $plaso_gdrive_folder
#plaso_file = 'gdrive/MyDrive/mus2019ctf.plaso'
#!ls -l $plaso_file

-r-------- 1 root root 10616142693 Mar 21  2019 /content/gdrive/MyDrive/Images/Windows/MUS-CTF-19-DESKTOP-001.E01
-rw------- 1 root root 403488768 Aug 15 15:17 gdrive/MyDrive/mus2019ctf.plaso


Use psort to write events to Elasticsearch that we setup earlier. We can use the elastic output format

In [None]:
# Before we do that, let's take a look at the opensearch.mappings file that comes with plaso
# actually there is more in that folder that you may be interested in
#
!ls /usr/share/plaso

filter_no_winsxs.yaml  opensearch.mappings  sources.config   winevt-rc.db
filter_windows.txt     plaso-data.README    tag_linux.txt
filter_windows.yaml    presets.yaml	    tag_macos.txt
formatters	       signatures.conf	    tag_windows.txt


In [None]:
# let's take a look at the opensearch.mappings
#
!cat /usr/share/plaso/opensearch.mappings

{
  "properties": {
    "application": {
      "type": "text",
      "fields": {
        "keyword": {"type": "keyword"}}
    },
    "data": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "doc_type": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "event_type": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "exit_status": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "facility": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "file_reference": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "file_size": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "flags": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "identifier": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
 

In [None]:
# Field file_size has type text. 
#
#    "file_size": {
#      "type": "text",
#      "fields": {"keyword": {"type": "keyword"}}
#    },
#
# Let's change the mapping for file_size to type long
#
#
#    "file_size": {
#      "type": "long"
#    },

In [None]:
# run psort.py. It takes about 13 minutes to export all rows from the 385MB plaso file to Opensearch

#
!psort.py -o opensearch --server localhost --port 9200 --opensearch-user admin --opensearch-password admin --opensearch_mappings /usr/share/plaso/opensearch.mappings --use_ssl --ca_certificates_file_path /content/opensearch-2.2.0/config/root-ca.pem --index_name newmus2019ctf $plaso_file --status_view none 

2022-08-16 10:13:27,408 [INFO] (MainProcess) PID:13001 <data_location> Determined data location: /usr/share/plaso
Processing completed.


In [None]:
# Let's take a look again at the indices in our Elasticsearch instance
#
#!curl -X GET "https://localhost:9200/_cat/indices?format=json&pretty" -u admin:admin --insecure
!curl -X GET "https://localhost:9200/_cat/indices?v" -u admin:admin --insecure

health status index                        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   security-auditlog-2022.08.16 DJB4b0LsTrGQcIazwR28sQ   1   1         40            0    168.8kb        168.8kb
yellow open   newmus2019ctf                PYOaJzhMSW616S5gCN6nGg   1   1     805010            0    377.5mb        377.5mb
green  open   .opendistro_security         a2dJDoODRAWjjqnKuWByyA   1   0         10            0     69.1kb         69.1kb


In [None]:
# we can also see what fields were mapped in this index by Psort.py
#
!curl -XGET "https://localhost:9200/newmus2019ctf/_mapping?format=json&pretty" -u admin:admin --insecure


{
  "newmus2019ctf" : {
    "mappings" : {
      "properties" : {
        "access_count" : {
          "type" : "long"
        },
        "account_rid" : {
          "type" : "long"
        },
        "application" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword"
            }
          }
        },
        "binary_path" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "birth_droid_file_identifier" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "birth_droid_volume_identifier" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }


## 4 Accessing Elasticsearch via the REST API

In [None]:
# we can use the elastic search api to get the first 10 results
#
!curl -sX GET "https://localhost:9200/_search?format=json&pretty" -u admin:admin --insecure

{
  "took" : 183,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "newmus2019ctf",
        "_id" : "nEckpoIBNwEKF-Xi8i4u",
        "_score" : 1.0,
        "_source" : {
          "data_type" : "pe",
          "parser" : "pe",
          "imphash" : "ee7cb4a1dbc04570233ef522c5ab5b76",
          "pe_type" : "Dynamic Link Library (DLL)",
          "section_names" : [
            ".text\u0000\u0000\u0000",
            ".rdata\u0000\u0000",
            ".data\u0000\u0000\u0000",
            ".rsrc\u0000\u0000\u0000",
            ".reloc\u0000\u0000"
          ],
          "path_spec" : "{\"__type__\": \"PathSpec\", \"mft_entry\": 99903, \"mft_attribute\": 2, \"location\": \"\\\\Users\\\\Administrator\\\\Desktop\\\\FTK_Imager_Lite_3.1.1\\\\adfs_globals.dll\", \"parent\": {\"__

In [None]:
# show the settings for the newmus2019ctf index

!curl -sX GET "https://localhost:9200/newmus2019ctf/_settings?format=json&pretty" -u admin:admin --insecure

{
  "newmus2019ctf" : {
    "settings" : {
      "index" : {
        "creation_date" : "1660644810956",
        "number_of_shards" : "1",
        "number_of_replicas" : "1",
        "uuid" : "PYOaJzhMSW616S5gCN6nGg",
        "version" : {
          "created" : "136237827"
        },
        "provided_name" : "newmus2019ctf"
      }
    }
  }
}


## 5 Accessing the Opensearch API in Python

In [None]:
# So far we have been accessing information directly with curl from the Opensearch REST API
# The is also an Opensearch Python API that we can use. See https://opensearch.org/docs/latest/clients/python/
#

from opensearchpy import OpenSearch

host = 'localhost'
port = 9200
auth = ('admin', 'admin') # For testing only. Don't store credentials in code.
ca_certs_path = '/content/opensearch-2.2.0/config/root-ca.pem' # Provide a CA bundle if you use intermediate CAs with your root CA.

# Create the client with SSL/TLS enabled, but hostname verification disabled.
client = OpenSearch(
    hosts = [{'host': 'localhost', 'port': 9200}],
    http_compress = True, # enables gzip compression for request bodies
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    ssl_assert_hostname = False,
    ssl_show_warn = False,
    ca_certs = ca_certs_path
)


Which indexes are available?

In [None]:
client.indices.get_alias("*")

{'security-auditlog-2022.08.16': {'aliases': {}},
 'newmus2019ctf': {'aliases': {}},
 '.opendistro_security': {'aliases': {}}}

Search the index with a full-text query and get the first 5 results

In [None]:

response = client.search(index="newmus2019ctf", body={"query": {"match": {"message": { "query": "selmabouvier"  }}}}, size=5)
docs = response['hits']['hits']
docs

[{'_index': 'newmus2019ctf',
  '_id': 'jUwppoIBNwEKF-XieZor',
  '_score': 7.8198056,
  '_source': {'data_type': 'fs:stat',
   'parser': 'filestat',
   'display_name': 'NTFS:\\Users\\SelmaBouvier\\Desktop\\desktop.ini',
   'file_entry_type': 'file',
   'file_size': 282,
   'file_system_type': 'NTFS',
   'filename': '\\Users\\SelmaBouvier\\Desktop\\desktop.ini',
   'inode': 1125899906949326,
   'is_allocated': True,
   'path_spec': '{"__type__": "PathSpec", "mft_entry": 106702, "mft_attribute": 1, "location": "\\\\Users\\\\SelmaBouvier\\\\Desktop\\\\desktop.ini", "parent": {"__type__": "PathSpec", "part_index": 2, "start_offset": 576716800, "location": "/p1", "parent": {"__type__": "PathSpec", "parent": {"__type__": "PathSpec", "location": "/content/gdrive/MyDrive/Images/Windows/MUS-CTF-19-DESKTOP-001.E01", "type_indicator": "OS"}, "type_indicator": "EWF"}, "type_indicator": "TSK_PARTITION"}, "type_indicator": "NTFS"}',
   'sha256_hash': '4b9d687ac625690fd026ed4b236dad1cac90ef69e7ad256cc

In [None]:
# The '_source' property has the psort values
#
for num, doc in enumerate(docs):
  print(num, '-->', doc['_source'], "\n")

0 --> {'data_type': 'fs:stat', 'parser': 'filestat', 'display_name': 'NTFS:\\Users\\SelmaBouvier\\Desktop\\desktop.ini', 'file_entry_type': 'file', 'file_size': 282, 'file_system_type': 'NTFS', 'filename': '\\Users\\SelmaBouvier\\Desktop\\desktop.ini', 'inode': 1125899906949326, 'is_allocated': True, 'path_spec': '{"__type__": "PathSpec", "mft_entry": 106702, "mft_attribute": 1, "location": "\\\\Users\\\\SelmaBouvier\\\\Desktop\\\\desktop.ini", "parent": {"__type__": "PathSpec", "part_index": 2, "start_offset": 576716800, "location": "/p1", "parent": {"__type__": "PathSpec", "parent": {"__type__": "PathSpec", "location": "/content/gdrive/MyDrive/Images/Windows/MUS-CTF-19-DESKTOP-001.E01", "type_indicator": "OS"}, "type_indicator": "EWF"}, "type_indicator": "TSK_PARTITION"}, "type_indicator": "NTFS"}', 'sha256_hash': '4b9d687ac625690fd026ed4b236dad1cac90ef69e7ad256cc42766a065b50026', 'datetime': '2019-02-25T20:47:01.374840+00:00', 'message': 'NTFS:\\Users\\SelmaBouvier\\Desktop\\desktop

In [None]:
# We define a Python function to list results
#
def print_results(response):
  for num, doc in enumerate(response['hits']['hits']):
    print(num, '-->', doc['_source']) 

def print_results_detailed(response):
  for num, doc in enumerate(response['hits']['hits']):
    print('\n---------------------------------------------------------------------------------------------------\nresult numer: ',num) 
    for key, val in doc['_source'].items():
      print(key, val)

In [None]:
# we can try this function on the response we got earlier
#
print_results_detailed(response)


---------------------------------------------------------------------------------------------------
result numer:  0
data_type windows:lnk:link
parser lnk
drive_serial_number 2935122090
drive_type 3
file_attribute_flags 0
file_size 0
link_target <My Computer> D:\OneDrive_1_3-18-2019.zip
local_path D:\OneDrive_1_3-18-2019.zip
volume_label SuperDuperSecretStuff
path_spec {"__type__": "PathSpec", "mft_entry": 99166, "mft_attribute": 2, "location": "\\Users\\SelmaBouvier\\AppData\\Roaming\\Microsoft\\Windows\\Recent\\OneDrive_1_3-18-2019.lnk", "parent": {"__type__": "PathSpec", "part_index": 2, "start_offset": 576716800, "location": "/p1", "parent": {"__type__": "PathSpec", "parent": {"__type__": "PathSpec", "location": "/content/gdrive/MyDrive/Images/Windows/MUS-CTF-19-DESKTOP-001.E01", "type_indicator": "OS"}, "type_indicator": "EWF"}, "type_indicator": "TSK_PARTITION"}, "type_indicator": "NTFS"}
sha256_hash e0151b1d37b1ec68fa8192bc7e935d393e2af707900506114424d0be1ed55c6a
datetime 1970-

In [None]:
# Opensearch query syntax is quite elaborate. We will provide some examples in this colab
# notice that results can have different fields. It depends on the data_type
# For a complete overview see the Opensearch reference documents
#
# https://opensearch.org/docs/latest/opensearch/query-dsl/index/
#
#query = '{"query": { "query_string": {"query": "source_short: WEBHIST"  }}}'
query = '{"query": { "query_string": {"query": "data_type: windows*link"  }}}'
#query = '{"query": { "query_string": {"query": "drive_type: 3"  }}}'
#query = '{"query": { "query_string": {"query": "drive_type: 3 AND data_type: windows*link" }}}'
#query = '{"query": { "range": {"drive_type": { "gte":0 , "lte":2 } }}}'
#query = '{"query": { "range": {"drive_type": { "gte":1 , "lte":3 } }}}'
#query = '{"query": { "query_string": {"query": "drive_type:>=0 and drive_type:<2"  }}}'
#query = '{"query": { "query_string": {"query": "file_size:>100000 "  }}}'
#query = '{"query": { "query_string": {"query": "file_size:<29696"  }}}'
#query = '{"query": { "range": {"file_size": { "gte":10000, "lte":100000 }  }}}'
#query = '{"query": { "query_string": {"query": "data_type: msie\\\\:*"  }}}'


response = client.search(index="newmus2019ctf", body=query, size=15)

print_results_detailed(response)



---------------------------------------------------------------------------------------------------
result numer:  0
data_type windows:lnk:link
parser lnk
drive_serial_number 2935122090
drive_type 3
file_attribute_flags 0
file_size 0
link_target <My Computer> D:\OneDrive_1_3-18-2019.zip
local_path D:\OneDrive_1_3-18-2019.zip
volume_label SuperDuperSecretStuff
path_spec {"__type__": "PathSpec", "mft_entry": 99166, "mft_attribute": 2, "location": "\\Users\\SelmaBouvier\\AppData\\Roaming\\Microsoft\\Windows\\Recent\\OneDrive_1_3-18-2019.lnk", "parent": {"__type__": "PathSpec", "part_index": 2, "start_offset": 576716800, "location": "/p1", "parent": {"__type__": "PathSpec", "parent": {"__type__": "PathSpec", "location": "/content/gdrive/MyDrive/Images/Windows/MUS-CTF-19-DESKTOP-001.E01", "type_indicator": "OS"}, "type_indicator": "EWF"}, "type_indicator": "TSK_PARTITION"}, "type_indicator": "NTFS"}
sha256_hash e0151b1d37b1ec68fa8192bc7e935d393e2af707900506114424d0be1ed55c6a
datetime 1970-

## 6 Elasticsearch field aggregation

In [None]:
# First we define some helper functions:

def print_facets(agg_dict):
  sum=0
  for field, val in agg_dict:
      print("facets of field ", field,':')
      for bucket in val['buckets']:
        for key in bucket:
          if key=='key':
            print('\t',bucket[key],end='=')
          else:
            print(bucket[key],end='')
            sum = sum + bucket[key]
        print()
      print("total number of hits for ",field," is ",sum)

def print_hit_stats(response):
  print('hit stats:')
  for key, val in response['hits'].items():
      print(key, val)
  print('\n')


In [None]:
querystring = '{ "query_string": {"query": "source_short: WEBHIST"  }}'

query = '{"query": %s}' % querystring

print(query)


{"query": { "query_string": {"query": "source_short: WEBHIST"  }}}


In [None]:
# Aggregating results is one of the most powerful options in Elasticsearch
#
# https://opensearch.org/docs/latest/opensearch/aggregations/
#
querystring = '{ "query_string": {"query": "SelmaBouvier"  }}'
facets = '"aggs": { "data_type": { "terms": { "field": "data_type.keyword"}}}'
query = '{"query": %s,%s}' % (querystring,facets)

print(query)


{"query": { "query_string": {"query": "SelmaBouvier"  }},"aggs": { "data_type": { "terms": { "field": "data_type.keyword"}}}}


In [None]:
response = client.search(index="newmus2019ctf", body=query, size=0)
print_hit_stats(response)

print_facets(response['aggregations'].items())
# print_results(response)

hit stats:
total {'value': 3820, 'relation': 'eq'}
max_score None
hits []


facets of field  data_type :
	 windows:evtx:record=2698
	 msie:webcache:container=522
	 fs:stat=220
	 windows:prefetch:execution=172
	 windows:lnk:link=53
	 msie:webcache:containers=31
	 windows:registry:key_value=28
	 windows:registry:appcompatcache=27
	 windows:distributed_link_tracking:creation=25
	 windows:shell_item:file_entry=21
total number of hits for  data_type  is  3797


In [None]:
# Aggregate accross multiple facets
#
querystring = '{ "query_string": {"query": "SelmaBouvier"  }}'

facets = '"aggs": { "parser": { "terms": { "field": "parser.keyword"}},  "data_type": { "terms": { "field": "data_type.keyword"}}}'
query = '{"query": %s,%s}' % (querystring,facets)

response = client.search(index="newmus2019ctf", body=query, size=0)

print_facets(response['aggregations'].items())


facets of field  parser :
	 winreg/winreg_default=181109
	 winevtx=58506
	 usnjrnl=1005
	 winreg/windows_services=241
	 pe=86
	 olecf/olecf_automatic_destinations/lnk=52
	 lnk=19
	 winreg/windows_task_cache=16
	 olecf/olecf_automatic_destinations=13
	 winreg/windows_sam_users=7
total number of hits for  parser  is  241054
facets of field  data_type :
	 windows:registry:key_value=181109
	 windows:evtx:record=58506
	 fs:ntfs:usn_change=1005
	 windows:registry:service=241
	 pe=86
	 windows:lnk:link=71
	 task_scheduler:task_cache:entry=16
	 olecf:dest_list:entry=13
	 windows:registry:sam_users=7
total number of hits for  data_type  is  482108


In [None]:
# aggregate accross datetime
# 
# also see pipeline aggregations https://opensearch.org/docs/latest/opensearch/pipeline-agg/

querystring = '{ "query_string": {"query": "\\\<\\\/Event\\\>"  }}'

facets = '"aggs": { "datetime": { "date_histogram": { "field": "datetime", "calendar_interval": "year"}}}'

query = '{"query": %s,%s}' % (querystring,facets)

response = client.search(index="newmus2019ctf", body=query, size=0)

response

{'took': 106,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 10000, 'relation': 'gte'},
  'max_score': None,
  'hits': []},
 'aggregations': {'datetime': {'buckets': [{'key_as_string': '1970-01-01T00:00:00.000Z',
     'key': 0,
     'doc_count': 2},
    {'key_as_string': '1971-01-01T00:00:00.000Z',
     'key': 31536000000,
     'doc_count': 0},
    {'key_as_string': '1972-01-01T00:00:00.000Z',
     'key': 63072000000,
     'doc_count': 0},
    {'key_as_string': '1973-01-01T00:00:00.000Z',
     'key': 94694400000,
     'doc_count': 0},
    {'key_as_string': '1974-01-01T00:00:00.000Z',
     'key': 126230400000,
     'doc_count': 0},
    {'key_as_string': '1975-01-01T00:00:00.000Z',
     'key': 157766400000,
     'doc_count': 0},
    {'key_as_string': '1976-01-01T00:00:00.000Z',
     'key': 189302400000,
     'doc_count': 0},
    {'key_as_string': '1977-01-01T00:00:00.000Z',
     'key': 220924800000,
     'doc_count'

In [None]:
# aggregate accross file_size (must have changed type text to long in mappings file before psort.py)
# 
# also see pipeline aggregations https://opensearch.org/docs/latest/opensearch/pipeline-agg/

querystring = '{ "query_string": {"query": "file_size:<100000000"  }}'

facets = '"aggs": { "file_size": { "histogram": { "field": "file_size", "interval": 10000000}}}'

query = '{"query": %s,%s}' % (querystring,facets)

response = client.search(index="newmus2019ctf", body=query, size=0)

print_facets(response['aggregations'].items())

facets of field  file_size :
	 0.0=3827
	 10000000.0=20
	 20000000.0=28
	 30000000.0=4
	 40000000.0=11
	 50000000.0=0
	 60000000.0=0
	 70000000.0=4
total number of hits for  file_size  is  3894


In [None]:
# date range search

query = '{"query": { "query_string": {"query": "datetime:[2019-03-12 TO 2019-03-22]"  }}}'

print(query)
response = client.search(index="newmus2019ctf", body=query, size=10)
print_results_detailed(response)


{"query": { "query_string": {"query": "datetime:[2019-03-12 TO 2019-03-22]"  }}}

---------------------------------------------------------------------------------------------------
result numer:  0
data_type windows:evtx:record
parser winevtx
computer_name DESKTOP-0QT8017
event_identifier 100
event_level 4
message_identifier 100
offset 0
provider_identifier {6bba3851-2c7e-4dea-8f54-31e5afd029e3}
record_number 4679
recovered False
source_name Microsoft-Windows-Diagnosis-DPS
strings ['{180B3A99-8C39-4F12-B631-2031998EFE45}', '{5AE2C742-1D4A-4568-A41A-73B87D7A808B}', '{00000000-0000-0000-0000-000000000000}', '%windir%\\system32\\radardt.dll', '{45DE1EA9-10BC-4F96-9B21-4B6B83DBF476}']
user_sid S-1-5-19
xml_string <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Diagnosis-DPS" Guid="{6BBA3851-2C7E-4DEA-8F54-31E5AFD029E3}"/>
    <EventID>100</EventID>
    <Version>0</Version>
    <Level>4</Level>
    <Task>1</Task>
    <O

## 7 Putting Elasticsearch json out into a Pandas dataframe 

In [None]:
# the out is json format which we can store in a pandas dataframe
import pandas as pd
import json
from io import StringIO

output = !curl -sX GET "https://localhost:9200/_search?q=logon" -u admin:admin --insecure
df = pd.read_json(StringIO(output[0]))


In [None]:
df.head

<bound method NDFrame.head of             took  timed_out  _shards  \
total        133      False      3.0   
successful   133      False      3.0   
skipped      133      False      0.0   
failed       133      False      0.0   
max_score    133      False      NaN   
hits         133      False      NaN   

                                                         hits  
total                        {'value': 431, 'relation': 'eq'}  
successful                                                NaN  
skipped                                                   NaN  
failed                                                    NaN  
max_score                                           10.792242  
hits        [{'_index': 'newmus2019ctf', '_id': 'BEsnpoIBN...  >

In [None]:
df['hits']['hits'][:1]

[{'_index': 'newmus2019ctf',
  '_id': 'BEsnpoIBNwEKF-Xi-gHh',
  '_score': 10.792242,
  '_source': {'data_type': 'fs:stat',
   'parser': 'filestat',
   'display_name': 'NTFS:\\Windows\\System32\\Tasks\\Microsoft\\Windows\\Offline Files\\Logon Synchronization',
   'file_entry_type': 'file',
   'file_size': 2840,
   'file_system_type': 'NTFS',
   'filename': '\\Windows\\System32\\Tasks\\Microsoft\\Windows\\Offline Files\\Logon Synchronization',
   'inode': 281474976794476,
   'is_allocated': True,
   'path_spec': '{"__type__": "PathSpec", "mft_entry": 83820, "mft_attribute": 2, "location": "\\\\Windows\\\\System32\\\\Tasks\\\\Microsoft\\\\Windows\\\\Offline Files\\\\Logon Synchronization", "parent": {"__type__": "PathSpec", "part_index": 2, "start_offset": 576716800, "location": "/p1", "parent": {"__type__": "PathSpec", "parent": {"__type__": "PathSpec", "location": "/content/gdrive/MyDrive/Images/Windows/MUS-CTF-19-DESKTOP-001.E01", "type_indicator": "OS"}, "type_indicator": "EWF"}, "typ

# Exercises

## 1 Use elasticsearch to filter events in between 2019-03-12 and 2019-03-22

In [None]:
# Your answer

## 2 Write a query that performs an aggregation on source_long and source_short (can you find the right field names?)

In [None]:
# Your answer

## 3 Combine your date range filter from exercise 1 with facet aggregation in exercise 2

In [None]:
# Your answer

## 4 ***Advanced*** Use opensearch facet aggregation to create a treemap visualisation of a filtered set of events in the index.

####Step 1
The source_short and source_long look interesting for visualisation let's focus on REG, LOG and FILE and run a query. Try the following to see what this looks like:

In [None]:
facets = '"aggs": {   "source_long": { "terms": { "field": "source_long.keyword"}}, "source_short": { "terms": { "field": "source_short.keyword"}}}'
daterange = 'datetime:[2019-03-12 TO 2019-03-22]'
# FILE:
querystring = '{ "query_string": {"query": "%s AND  source_short:(FILE OR LOG OR REG)"  }}' % daterange
query = '{"query": %s,%s}' % (querystring,facets)
response = client.search(index="newmus2019ctf", body=query, size=0,track_total_hits=True)
print_facets(response['aggregations'].items())

Why is this not very helpful for a treemap visualisation?

In [None]:
# Your answer

####Step 2
One approach is to aggregate only accross source_long an run 3 separate queries for source_short equal to REG, LOG and FILE respectively and then combine the results into a single dataframe that we can visualise:

In [None]:
# Your answer. If you are struggling you can skip this step and move on to step 3.

####Step 3
Opensearch can also aggregate accros multiple fields combined. This is called multi_terms aggregation which aggregates accross all (source_long, source_short) value pairs. Note: by default opensearch returns 10 buckets max. We set it to 20 here using the size parameter.

See https://opensearch.org/docs/2.0/opensearch/bucket-agg/ for more information about opensearch and multi_term aggregation.

In [None]:
# Your answer

####Step 4
The exercise asked for aggregation accross 3 fields. So let's add the parser field and deepen our treemap visualisation

In [None]:
# Your answer