<a href="https://colab.research.google.com/github/HansHenseler/masdav2024/blob/main/Part_4_Opensearch_and_log2timeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Opensearch and log2timeline

Exercise 4:

Master of Advanced Studies in Digital Forensics & Cyber Investigation

Data Analytics and Visualization for Digital Forensics

(c) Hans Henseler, 2024


## 1 Installing plaso tools in the colab notebook

First install Plaso-tools as we did in exercise 3

In [None]:
# various install steps to install plaso tools and dependencies to get plaso working in colab
# -y option is to skip user interaction
# some packages need to be deinstalled and reinstalled to resolve dependencies
# these steps take app. 4 minutes to complete on a fresh colab instance
!add-apt-repository -y ppa:gift/stable
!apt update
!apt-get update
!apt install plaso-tools

In [4]:
# This notebook was tested with version 20240308 (since 2022 psort.py assumes opensearch and no longer supports elasticsearch)

!psort.py -V

plaso - psort version 20240308


In [3]:
# check if plaso tools were installed by running psort.py

!psort.py -o list



******************************** Output Modules ********************************
         Name : Description
--------------------------------------------------------------------------------
      dynamic : Dynamic selection of fields for a separated value output
                format.
         json : Saves the events into a JSON format.
    json_line : Saves the events into a JSON line format.
          kml : Saves events with geography data into a KML format.
       l2tcsv : CSV format used by legacy log2timeline, with 17 fixed fields.
       l2ttln : Extended TLN 7 field | delimited output.
         null : Output module that does not output anything.
   opensearch : Saves the events into an OpenSearch database.
opensearch_ts : Saves the events into an OpenSearch database for use with
                Timesketch.
        rawpy : native (or "raw") Python output.
          tln : TLN 5 field | delimited output.
         xlsx : Excel Spreadsheet (XLSX) output
----------------------------

## 2 Download and setup the Opensearch instance

There are different was to install and use Opensearch. Because we are working in a notebook we will download and install using a tar-ball which can be downloaded from: https://opensearch.org/downloads.html .

In [5]:
!wget -q https://artifacts.opensearch.org/releases/bundle/opensearch/2.15.0/opensearch-2.15.0-linux-x64.tar.gz
!wget -q https://artifacts.opensearch.org/releases/bundle/opensearch/2.15.0/opensearch-2.15.0-linux-x64.tar.gz.sha512

!tar -zxf opensearch-2.15.0-linux-x64.tar.gz
!shasum -a 512 -c opensearch-2.15.0-linux-x64.tar.gz.sha512

opensearch-2.15.0-linux-x64.tar.gz: OK


In [6]:
!whoami

root


In [7]:
# change the owner of the Opensearch filetree from root to daemon because Opensearch cannot run as root.

!sudo chown -R daemon:daemon opensearch-2.15.0/

Run Elasticsearch as a daemon process

In [8]:
# start Opensearch from the commandline as user daemon.

!sudo -H -u daemon sh -c  "export OPENSEARCH_INITIAL_ADMIN_PASSWORD='P?ssw0rd1'; opensearch-2.15.0/opensearch-tar-install.sh 1> /content/opensearch-2.15.0/logs/os.log 2> /content/opensearch-2.15.0/logs/os.err &"

In [9]:
# Sleep for few seconds to let the instance start.
import time
time.sleep(20)

Once the instance has been started, grep for elasticsearch in the processes list to confirm the availability.

In [10]:
# this should look like this:
# daemon     10773       1 99 18:28 ?        00:00:50 /content/opensearch-2.9.0/jdk/bin/java -Xshare:auto -Dopensearch.networkaddress.cache.ttl=60 -Dopensearch.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/opensearch-6691642717485449239 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Dclk.tck=100 -Djdk.attach.allowAttachSelf=true -Djava.security.policy=/content/opensearch-2.9.0/config/opensearch-performance-analyzer/opensearch_security.policy --add-opens=jdk.attach/sun.tools.attach=ALL-UNNAMED -XX:MaxDirectMemorySize=536870912 -Dopensearch.path.home=/content/opensearch-2.9.0 -Dopensearch.path.conf=/content/opensearch-2.9.0/config -Dopensearch.distribution.type=tar -Dopensearch.bundled_jdk=true -cp /content/opensearch-2.9.0/lib/* org.opensearch.bootstrap.OpenSearch
# root       11243     404  0 18:29 ?        00:00:00 /bin/bash -c ps -ef | grep opensearch
# root       11245   11243  0 18:29 ?        00:00:00 grep opensearch

!ps -ef | grep opensearch

daemon     32837       1 99 12:48 ?        00:00:40 /content/opensearch-2.15.0/jdk/bin/java -Xshare:
root       33167     601  0 12:49 ?        00:00:00 /bin/bash -c ps -ef | grep opensearch
root       33170   33167  0 12:49 ?        00:00:00 grep opensearch


Query the base endpoint at localhost port 9200 to retrieve information about the cluster. Opensearch requires https, username and password.

In [11]:
# This should look like this:
# {
#   "name" : "c6713b380721",
#   "cluster_name" : "opensearch",
#   "cluster_uuid" : "iSeFMOJ-QrSuWtndad89hw",
#   "version" : {
#     "distribution" : "opensearch",
#     "number" : "2.15.0",
#     "build_type" : "tar",
#     "build_hash" : "61dbcd0795c9bfe9b81e5762175414bc38bbcadf",
#     "build_date" : "2024-06-20T03:26:49.193630411Z",
#     "build_snapshot" : false,
#     "lucene_version" : "9.10.0",
#     "minimum_wire_compatibility_version" : "7.10.0",
#     "minimum_index_compatibility_version" : "7.0.0"
#   },
#   "tagline" : "The OpenSearch Project: https://opensearch.org/"
# }

!curl -XGET https://localhost:9200 -u admin:P?ssw0rd1 --insecure

{
  "name" : "ad9bb51b5370",
  "cluster_name" : "opensearch",
  "cluster_uuid" : "SA0VbylCSF-mS4sKUGPYNw",
  "version" : {
    "distribution" : "opensearch",
    "number" : "2.15.0",
    "build_type" : "tar",
    "build_hash" : "61dbcd0795c9bfe9b81e5762175414bc38bbcadf",
    "build_date" : "2024-06-20T03:26:49.193630411Z",
    "build_snapshot" : false,
    "lucene_version" : "9.10.0",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}


In [12]:
# This should look like this:
# health status index                        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
# green  open   .plugins-ml-config           UCML1zUaSdOVW-xt1x7MJA   1   0          1            0      3.9kb          3.9kb
# green  open   .opensearch-observability    WFUYHyPoTESfAK39S_eIgw   1   0          0            0       208b           208b
# yellow open   security-auditlog-2024.08.07 KVXM_wPhT_uCMZg4o8GPQw   1   1          6            0     84.8kb         84.8kb
# green  open   .opendistro_security         LjbhbvqNSnuQ4eDUM8e6Xw   1   0         10            0     78.2kb         78.2kb

!curl -XGET "https://localhost:9200/_cat/indices?v" -u 'admin:P?ssw0rd1' --insecure

health status index                     uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .opensearch-observability bKN4LKHESiWE1bvXpaSjog   1   0          0            0       208b           208b
green  open   .plugins-ml-config        9XEJk73BSjSmLn7RIm74GA   1   0          1            0      3.8kb          3.8kb
green  open   .opendistro_security      yo4zzSAxR9us-bqeqf3Jwg   1   0         10            0     77.4kb         77.4kb


## 3 Use Log2timeline.py and Psort.py to load data in Elasticsearch

In [13]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [14]:
# In part 3 (step 3) we stored the mus2019ctf.plaso file in your drive.
#
plaso_file = 'gdrive/MyDrive/mus2019ctf.plaso'
#
# and check if it's there
#
!ls -l $plaso_file

# This should look something like this:
# -rw------- 1 root root 494424064 Aug 10 21:04 gdrive/MyDrive/mus2019ctf.plaso

-rw------- 1 root root 493842432 Aug  5 12:39 gdrive/MyDrive/mus2019ctf.plaso


In [None]:
# If it's not there you can create it by repeating the following steps
#
# The complete mus2019ctf.plaso file is 450MB and takes a while. After you have created it
# it makes sense to store it in your gdrive so you can reuse it:
#
# plaso_file = 'gdrive/MyDrive/mus2019ctf.plaso'
#
# if not you need to create it with log2timeline.py using the complete windows_filter.yml filter
#
# add a shortcut in your Google drive to this shared drive https://drive.google.com/drive/folders/1KUlZUl4Sy2JzgbuRW-oHjIGFClY2bl75?usp=sharing
# then mount you google drive in this colab (you need to authorize this colab to access your google drive)
#
#!pip install pyparsing==3.1.0
#disk_image = "/content/gdrive/MyDrive/Images/Windows/MUS-CTF-19-DESKTOP-001.E01"
# filter_windows = "/content/gdrive/MyDrive/Testdata/filter_windows.yml"
# !ls -l  $filter_windows
# !log2timeline.py -f $filter_windows --storage_file mus2019ctf.plaso $disk_image --parsers win7 --status_view none
# !cp mus2019ctf.plaso "/content/gdrive/MyDrive"

Use psort to write events to Elasticsearch that we setup earlier. We can use the elastic output format

In [15]:
# Before we do that, let's take a look at the opensearch.mappings file that comes with plaso
# actually there is more in that folder that you may be interested in
#
!ls /usr/share/plaso

filter_no_winsxs.yaml  opensearch.mappings  tag_linux.txt    timeliner.yaml
filter_windows.yaml    presets.yaml	    tag_macos.txt    winevt-rc.db
formatters	       signatures.conf	    tag_windows.txt


In [16]:
# let's take a look at the opensearch.mappings
#
!cat /usr/share/plaso/opensearch.mappings

{
  "properties": {
    "application": {
      "type": "text",
      "fields": {
        "keyword": {"type": "keyword"}}
    },
    "data": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "doc_type": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "event_type": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "exit_status": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "facility": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "file_reference": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "file_size": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "flags": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
    "identifier": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },
 

In [17]:
# How is file_size defined?

!grep -A 3 file_size /usr/share/plaso/opensearch.mappings

    "file_size": {
      "type": "text",
      "fields": {"keyword": {"type": "keyword"}}
    },


In [None]:
# Let's change the mapping for file_size to type long
#
#
#    "file_size": {
#      "type": "long"
#    },

In [18]:
!sudo sed -e '/"file_size": {/,/},/c\"file_size": {\n    "type": "long"\n},' /usr/share/plaso/opensearch.mappings -i
!grep -A 2 file_size /usr/share/plaso/opensearch.mappings


"file_size": {
    "type": "long"
},


In [19]:
# run psort.py. It takes about 11-13 minutes to export all rows from the 472MB plaso file to OpenSearch

#
!psort.py -o opensearch --server localhost --port 9200 --opensearch-user admin --opensearch-password P?ssw0rd1 --opensearch_mappings /usr/share/plaso/opensearch.mappings --use_ssl --ca_certificates_file_path /content/opensearch-2.15.0/config/root-ca.pem --index_name newmus2019ctf $plaso_file --status_view none

Processing completed.


In [20]:
# Let's take a look again at the indices in our Elasticsearch instance
# This should look like this:
# health status index                        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
# green  open   .plugins-ml-config           UCML1zUaSdOVW-xt1x7MJA   1   0          1            0      3.9kb          3.9kb
# green  open   .opensearch-observability    WFUYHyPoTESfAK39S_eIgw   1   0          0            0       208b           208b
# yellow open   newmus2019ctf                R32IE-4sRkWPgYAJq1GOaA   1   1     805683            0    403.4mb        403.4mb
# yellow open   security-auditlog-2024.08.07 KVXM_wPhT_uCMZg4o8GPQw   1   1         47            0     99.7kb         99.7kb
# green  open   .opendistro_security         LjbhbvqNSnuQ4eDUM8e6Xw   1   0         10            0     78.2kb         78.2kb
#

!curl -X GET "https://localhost:9200/_cat/indices?v" -u admin:P?ssw0rd1 --insecure

health status index                        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .opensearch-observability    bKN4LKHESiWE1bvXpaSjog   1   0          0            0       208b           208b
green  open   .plugins-ml-config           9XEJk73BSjSmLn7RIm74GA   1   0          1            0      3.9kb          3.9kb
yellow open   security-auditlog-2024.08.20 BsldWG0nTNCvykzsEdGSRQ   1   1         41            0    136.6kb        136.6kb
yellow open   newmus2019ctf                y9potrVMSUWTvBR5VezgbA   1   1     530692            0    392.8mb        392.8mb
green  open   .opendistro_security         yo4zzSAxR9us-bqeqf3Jwg   1   0         10            0     78.2kb         78.2kb


In [21]:
# we can also see what fields were mapped in this index by Psort.py
#
!curl -XGET "https://localhost:9200/newmus2019ctf/_mapping?format=json&pretty" -u admin:P?ssw0rd1 --insecure


{
  "newmus2019ctf" : {
    "mappings" : {
      "properties" : {
        "access_count" : {
          "type" : "long"
        },
        "account_rid" : {
          "type" : "long"
        },
        "application" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword"
            }
          }
        },
        "birth_droid_file_identifier" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "birth_droid_volume_identifier" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "build_number" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }

## 4 Accessing Elasticsearch via the REST API

In [22]:
# we can use the elastic search api to get the first 10 results
#
!curl -sX GET "https://localhost:9200/_search?format=json&pretty" -u admin:P?ssw0rd1 --insecure

{
  "took" : 2287,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "newmus2019ctf",
        "_id" : "spbab5EBFn9W8mx_ImDy",
        "_score" : 1.0,
        "_source" : {
          "data_type" : "pe_coff:file",
          "export_dll_name" : "boost_date_time-vc100-mt-1_49.dll",
          "imphash" : "cda5d00eb3ae60358030f63e087c91b6",
          "pe_type" : "Dynamic Link Library (DLL)",
          "section_names" : [
            ".text\u0000\u0000\u0000",
            ".rdata\u0000\u0000",
            ".data\u0000\u0000\u0000",
            ".reloc\u0000\u0000"
          ],
          "path_spec" : "{\"__type__\": \"PathSpec\", \"mft_attribute\": 2, \"mft_entry\": 99908, \"location\": \"\\\\Users\\\\Administrator\\\\Desktop\\\\FTK_Imager_Lite_3.1.1\\\\boost_date_time-vc100-mt-

In [23]:
# show the settings for the newmus2019ctf index

!curl -sX GET "https://localhost:9200/newmus2019ctf/_settings?format=json&pretty" -u admin:P?ssw0rd1 --insecure

{
  "newmus2019ctf" : {
    "settings" : {
      "index" : {
        "replication" : {
          "type" : "DOCUMENT"
        },
        "number_of_shards" : "1",
        "provided_name" : "newmus2019ctf",
        "creation_date" : "1724158446543",
        "number_of_replicas" : "1",
        "uuid" : "y9potrVMSUWTvBR5VezgbA",
        "version" : {
          "created" : "136367827"
        }
      }
    }
  }
}


## 5 Accessing the Opensearch API in Python

In [24]:
# So far we have been accessing information directly with curl from the Opensearch REST API
# The is also an Opensearch Python API that we can use. See https://opensearch.org/docs/latest/clients/python/
#

from opensearchpy import OpenSearch

host = 'localhost'
port = 9200
auth = ('admin', 'P?ssw0rd1') # For testing only. Don't store credentials in code.
ca_certs_path = '/content/opensearch-2.15.0/config/root-ca.pem' # Provide a CA bundle if you use intermediate CAs with your root CA.

# Create the client with SSL/TLS enabled, but hostname verification disabled.
client = OpenSearch(
    hosts = [{'host': 'localhost', 'port': 9200}],
    http_compress = True, # enables gzip compression for request bodies
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    ssl_assert_hostname = False,
    ssl_show_warn = False,
    ca_certs = ca_certs_path
)


Which indexes are available?

In [25]:
client.indices.get_alias("*")

{'.opensearch-observability': {'aliases': {}},
 '.plugins-ml-config': {'aliases': {}},
 'security-auditlog-2024.08.20': {'aliases': {}},
 '.opensearch-sap-log-types-config': {'aliases': {}},
 'newmus2019ctf': {'aliases': {}},
 '.opendistro_security': {'aliases': {}}}

Search the index with a full-text query and get the first 5 results

In [26]:

response = client.search(index="newmus2019ctf", body={"query": {"match": {"message": { "query": "selmabouvier"  }}}}, size=5)
docs = response['hits']['hits']
docs

[{'_index': 'newmus2019ctf',
  '_id': 'NZveb5EBFn9W8mx_zc19',
  '_score': 7.814361,
  '_source': {'data_type': 'fs:stat',
   'display_name': 'NTFS:\\Users\\SelmaBouvier\\Desktop\\desktop.ini',
   'file_entry_type': 'file',
   'file_size': 282,
   'file_system_type': 'NTFS',
   'filename': '\\Users\\SelmaBouvier\\Desktop\\desktop.ini',
   'inode': '1125899906949326',
   'is_allocated': True,
   'path_spec': '{"__type__": "PathSpec", "mft_attribute": 1, "mft_entry": 106702, "location": "\\\\Users\\\\SelmaBouvier\\\\Desktop\\\\desktop.ini", "parent": {"__type__": "PathSpec", "part_index": 2, "location": "/p1", "start_offset": 576716800, "parent": {"__type__": "PathSpec", "parent": {"__type__": "PathSpec", "location": "/content/gdrive/MyDrive/Images/Windows/MUS-CTF-19-DESKTOP-001.E01", "type_indicator": "OS"}, "type_indicator": "EWF"}, "type_indicator": "TSK_PARTITION"}, "type_indicator": "NTFS"}',
   'sha256_hash': '4b9d687ac625690fd026ed4b236dad1cac90ef69e7ad256cc42766a065b50026',
   'da

In [27]:
# The '_source' property has the psort values
#
for num, doc in enumerate(docs):
  print(num, '-->', doc['_source'], "\n")

0 --> {'data_type': 'fs:stat', 'display_name': 'NTFS:\\Users\\SelmaBouvier\\Desktop\\desktop.ini', 'file_entry_type': 'file', 'file_size': 282, 'file_system_type': 'NTFS', 'filename': '\\Users\\SelmaBouvier\\Desktop\\desktop.ini', 'inode': '1125899906949326', 'is_allocated': True, 'path_spec': '{"__type__": "PathSpec", "mft_attribute": 1, "mft_entry": 106702, "location": "\\\\Users\\\\SelmaBouvier\\\\Desktop\\\\desktop.ini", "parent": {"__type__": "PathSpec", "part_index": 2, "location": "/p1", "start_offset": 576716800, "parent": {"__type__": "PathSpec", "parent": {"__type__": "PathSpec", "location": "/content/gdrive/MyDrive/Images/Windows/MUS-CTF-19-DESKTOP-001.E01", "type_indicator": "OS"}, "type_indicator": "EWF"}, "type_indicator": "TSK_PARTITION"}, "type_indicator": "NTFS"}', 'sha256_hash': '4b9d687ac625690fd026ed4b236dad1cac90ef69e7ad256cc42766a065b50026', 'datetime': '2019-02-25T20:47:01.374840+00:00', 'message': 'NTFS:\\Users\\SelmaBouvier\\Desktop\\desktop.ini Type: file', 's

In [28]:
# We define a Python function to list results
#
def print_results(response):
  for num, doc in enumerate(response['hits']['hits']):
    print(num, '-->', doc['_source'])

def print_results_detailed(response):
  for num, doc in enumerate(response['hits']['hits']):
    print('\n---------------------------------------------------------------------------------------------------\nresult numer: ',num)
    for key, val in doc['_source'].items():
      print(key, val)

In [29]:
# we can try this function on the response we got earlier
#
print_results_detailed(response)


---------------------------------------------------------------------------------------------------
result numer:  0
data_type fs:stat
display_name NTFS:\Users\SelmaBouvier\Desktop\desktop.ini
file_entry_type file
file_size 282
file_system_type NTFS
filename \Users\SelmaBouvier\Desktop\desktop.ini
inode 1125899906949326
is_allocated True
path_spec {"__type__": "PathSpec", "mft_attribute": 1, "mft_entry": 106702, "location": "\\Users\\SelmaBouvier\\Desktop\\desktop.ini", "parent": {"__type__": "PathSpec", "part_index": 2, "location": "/p1", "start_offset": 576716800, "parent": {"__type__": "PathSpec", "parent": {"__type__": "PathSpec", "location": "/content/gdrive/MyDrive/Images/Windows/MUS-CTF-19-DESKTOP-001.E01", "type_indicator": "OS"}, "type_indicator": "EWF"}, "type_indicator": "TSK_PARTITION"}, "type_indicator": "NTFS"}
sha256_hash 4b9d687ac625690fd026ed4b236dad1cac90ef69e7ad256cc42766a065b50026
datetime 2019-02-25T20:47:01.374840+00:00
message NTFS:\Users\SelmaBouvier\Desktop\de

In [34]:
# Opensearch query syntax is quite elaborate. We will provide some examples in this colab
# notice that results can have different fields. It depends on the data_type
# For a complete overview see the Opensearch reference documents
#
# https://opensearch.org/docs/latest/opensearch/query-dsl/index/
#
#query = '{"query": { "query_string": {"query": "source_short: WEBHIST"  }}}'
#query = '{"query": { "query_string": {"query": "data_type: windows*link"  }}}'
#query = '{"query": { "query_string": {"query": "drive_type: 3"  }}}'
#query = '{"query": { "query_string": {"query": "drive_type: 3 AND data_type: windows*link" }}}'
#query = '{"query": { "range": {"drive_type": { "gte":0 , "lte":2 } }}}'
#query = '{"query": { "range": {"drive_type": { "gte":1 , "lte":3 } }}}'
#query = '{"query": { "query_string": {"query": "drive_type:>=0 and drive_type:<2"  }}}'
#query = '{"query": { "query_string": {"query": "SelmaBouvier"  }}}'
#query = '{"query": { "query_string": {"query": "file_size:>100000 "  }}}'
#query = '{"query": { "query_string": {"query": "file_size:<29696"  }}}'
#query = '{"query": { "range": {"file_size": { "gte":10000, "lte":100000 }  }}}'
#query = '{"query": { "query_string": {"query": "data_type: msie\\\\:*"  }}}'
query = '{"query": { "query_string": {"query": "www.msftconnecttest.com"  }}}'


response = client.search(index="newmus2019ctf", body=query, size=15)

print_results_detailed(response)



---------------------------------------------------------------------------------------------------
result numer:  0
data_type windows:registry:key_value
key_path HKEY_LOCAL_MACHINE\System\ControlSet001\Services\NlaSvc\Parameters\Internet
values [['ActiveDnsProbeContent', 'REG_SZ', '131.107.255.255'], ['ActiveDnsProbeContentV6', 'REG_SZ', 'fd3e:4f5a:5b81::1'], ['ActiveDnsProbeHost', 'REG_SZ', 'dns.msftncsi.com'], ['ActiveDnsProbeHostV6', 'REG_SZ', 'dns.msftncsi.com'], ['ActiveWebProbeContent', 'REG_SZ', 'Microsoft Connect Test'], ['ActiveWebProbeContentV6', 'REG_SZ', 'Microsoft Connect Test'], ['ActiveWebProbeHost', 'REG_SZ', 'www.msftconnecttest.com'], ['ActiveWebProbeHostV6', 'REG_SZ', 'ipv6.msftconnecttest.com'], ['ActiveWebProbePath', 'REG_SZ', 'connecttest.txt'], ['ActiveWebProbePathV6', 'REG_SZ', 'connecttest.txt'], ['EnableActiveProbing', 'REG_DWORD_LE', '1'], ['PassivePollPeriod', 'REG_DWORD_LE', '15'], ['StaleThreshold', 'REG_DWORD_LE', '30'], ['WebTimeout', 'REG_DWORD_LE', '

## 6 Elasticsearch field aggregation

In [35]:
# First we define some helper functions:

def print_facets(agg_dict):
  sum=0
  for field, val in agg_dict:
      print("facets of field ", field,':')
      for bucket in val['buckets']:
        for key in bucket:
          if key=='key':
            print('\t',bucket[key],end='=')
          else:
            print(bucket[key],end='')
            sum = sum + bucket[key]
        print()
      print("total number of hits for ",field," is ",sum)

def print_hit_stats(response):
  print('hit stats:')
  for key, val in response['hits'].items():
      print(key, val)
  print('\n')


In [36]:
querystring = '{ "query_string": {"query": "source_short: WEBHIST"  }}'

query = '{"query": %s}' % querystring

print(query)


{"query": { "query_string": {"query": "source_short: WEBHIST"  }}}


In [37]:
# Aggregating results is one of the most powerful options in Elasticsearch
#
# https://opensearch.org/docs/latest/opensearch/aggregations/
#
querystring = '{ "query_string": {"query": "SelmaBouvier"  }}'
facets = '"aggs": { "data_type": { "terms": { "field": "data_type.keyword"}}}'
query = '{"query": %s,%s}' % (querystring,facets)

print(query)


{"query": { "query_string": {"query": "SelmaBouvier"  }},"aggs": { "data_type": { "terms": { "field": "data_type.keyword"}}}}


In [38]:
# we can search and ask for size=0 results but we will get the total number of hits (=results) and also the facet counts

response = client.search(index="newmus2019ctf", body=query, size=0)
print_hit_stats(response)

print_facets(response['aggregations'].items())
# print_results(response)

hit stats:
total {'value': 3879, 'relation': 'eq'}
max_score None
hits []


facets of field  data_type :
	 windows:evtx:record=2698
	 msie:webcache:container=522
	 fs:stat=220
	 windows:prefetch:execution=172
	 windows:lnk:link=53
	 windows:registry:key_value=44
	 msie:webcache:cookie=40
	 msie:webcache:containers=31
	 windows:registry:appcompatcache=27
	 windows:distributed_link_tracking:creation=25
total number of hits for  data_type  is  3832


In [39]:
# Aggregate accross multiple facets
#
querystring = '{ "query_string": {"query": "SelmaBouvier"  }}'

facets = '"aggs": { "source_long": { "terms": { "field": "source_long.keyword"}},  "data_type": { "terms": { "field": "data_type.keyword"}}}'
query = '{"query": %s,%s}' % (querystring,facets)

response = client.search(index="newmus2019ctf", body=query, size=3)

print_facets(response['aggregations'].items())


facets of field  source_long :
	 WinEVTX=2698
	 MSIE WebCache container record=522
	 File stat=220
	 WinPrefetch=172
	 Windows Shortcut=53
	 Registry Key=44
	 MSIE WebCache cookies record=40
	 MSIE WebCache containers record=31
	 AppCompatCache Registry Key=27
	 System=25
total number of hits for  source_long  is  3832
facets of field  data_type :
	 windows:evtx:record=2698
	 msie:webcache:container=522
	 fs:stat=220
	 windows:prefetch:execution=172
	 windows:lnk:link=53
	 windows:registry:key_value=44
	 msie:webcache:cookie=40
	 msie:webcache:containers=31
	 windows:registry:appcompatcache=27
	 windows:distributed_link_tracking:creation=25
total number of hits for  data_type  is  7664


In [40]:
# aggregate accross datetime
#
# also see pipeline aggregations https://opensearch.org/docs/latest/opensearch/pipeline-agg/

querystring = '{ "query_string": {"query": "\\\<\\\/Event\\\>"  }}'

facets = '"aggs": { "datetime": { "date_histogram": { "field": "datetime", "calendar_interval": "year"}}}'

query = '{"query": %s,%s}' % (querystring,facets)

response = client.search(index="newmus2019ctf", body=query, size=0)

response

{'took': 927,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 10000, 'relation': 'gte'},
  'max_score': None,
  'hits': []},
 'aggregations': {'datetime': {'buckets': [{'key_as_string': '1970-01-01T00:00:00.000Z',
     'key': 0,
     'doc_count': 2},
    {'key_as_string': '1971-01-01T00:00:00.000Z',
     'key': 31536000000,
     'doc_count': 0},
    {'key_as_string': '1972-01-01T00:00:00.000Z',
     'key': 63072000000,
     'doc_count': 0},
    {'key_as_string': '1973-01-01T00:00:00.000Z',
     'key': 94694400000,
     'doc_count': 0},
    {'key_as_string': '1974-01-01T00:00:00.000Z',
     'key': 126230400000,
     'doc_count': 0},
    {'key_as_string': '1975-01-01T00:00:00.000Z',
     'key': 157766400000,
     'doc_count': 0},
    {'key_as_string': '1976-01-01T00:00:00.000Z',
     'key': 189302400000,
     'doc_count': 0},
    {'key_as_string': '1977-01-01T00:00:00.000Z',
     'key': 220924800000,
     'doc_count'

In [41]:
# aggregate accross file_size (must have changed type text to long in mappings file before psort.py)
#
# also see pipeline aggregations https://opensearch.org/docs/latest/opensearch/pipeline-agg/

querystring = '{ "query_string": {"query": "file_size:<100000000"  }}'

facets = '"aggs": { "file_size": { "histogram": { "field": "file_size", "interval": 10000000}}}'

query = '{"query": %s,%s}' % (querystring,facets)

response = client.search(index="newmus2019ctf", body=query, size=0)

print_facets(response['aggregations'].items())

facets of field  file_size :
	 0.0=3864
	 10000000.0=22
	 20000000.0=31
	 30000000.0=4
	 40000000.0=12
	 50000000.0=0
	 60000000.0=0
	 70000000.0=4
total number of hits for  file_size  is  3937


In [42]:
# date range search

query = '{"query": { "query_string": {"query": "datetime:[2019-03-12 TO 2019-03-22]"  }}}'

print(query)
response = client.search(index="newmus2019ctf", body=query, size=10)
print_results_detailed(response)


{"query": { "query_string": {"query": "datetime:[2019-03-12 TO 2019-03-22]"  }}}

---------------------------------------------------------------------------------------------------
result numer:  0
data_type windows:evtx:record
computer_name DESKTOP-0QT8017
event_identifier 100
event_level 4
event_version 0
message_identifier 100
offset 0
provider_identifier {6bba3851-2c7e-4dea-8f54-31e5afd029e3}
record_number 4679
recovered False
source_name Microsoft-Windows-Diagnosis-DPS
strings ['{180B3A99-8C39-4F12-B631-2031998EFE45}', '{5AE2C742-1D4A-4568-A41A-73B87D7A808B}', '{00000000-0000-0000-0000-000000000000}', '%windir%\\system32\\radardt.dll', '{45DE1EA9-10BC-4F96-9B21-4B6B83DBF476}']
user_sid S-1-5-19
xml_string <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Diagnosis-DPS" Guid="{6BBA3851-2C7E-4DEA-8F54-31E5AFD029E3}"/>
    <EventID>100</EventID>
    <Version>0</Version>
    <Level>4</Level>
    <Task>1</Task>
    <

## 7 Putting Elasticsearch json out into a Pandas dataframe

In [43]:
# The output is json format which we can store in a pandas dataframe
import pandas as pd
import json
from io import StringIO

output = !curl -sX GET "https://localhost:9200/_search?q=logon" -u admin:P?ssw0rd1 --insecure
df = pd.read_json(StringIO(output[0]))


In [48]:
df

Unnamed: 0,took,timed_out,_shards,hits
total,239,False,5.0,"{'value': 431, 'relation': 'eq'}"
successful,239,False,5.0,
skipped,239,False,0.0,
failed,239,False,0.0,
max_score,239,False,,17.892853
hits,239,False,,"[{'_index': 'newmus2019ctf', '_id': 'dpncb5EBF..."


In [49]:
df['hits']['hits'][:1]

[{'_index': 'newmus2019ctf',
  '_id': 'dpncb5EBFn9W8mx_489a',
  '_score': 17.892853,
  '_source': {'data_type': 'windows:registry:key_value',
   'key_path': 'HKEY_LOCAL_MACHINE\\System\\ControlSet001\\Control\\Terminal Server\\Utilities\\change',
   'values': [['logon', 'REG_MULTI_SZ', '[0, 1, LOGON, chglogon.exe]'],
    ['port', 'REG_MULTI_SZ', '[0, 1, PORT, chgport.exe]'],
    ['user', 'REG_MULTI_SZ', '[0, 1, USER, chgusr.exe]'],
    ['winsta', 'REG_MULTI_SZ', '[1, WINSTA, chglogon.exe]']],
   'path_spec': '{"__type__": "PathSpec", "mft_attribute": 1, "mft_entry": 83339, "location": "\\\\Windows\\\\System32\\\\config\\\\SYSTEM", "parent": {"__type__": "PathSpec", "part_index": 2, "location": "/p1", "start_offset": 576716800, "parent": {"__type__": "PathSpec", "parent": {"__type__": "PathSpec", "location": "/content/gdrive/MyDrive/Images/Windows/MUS-CTF-19-DESKTOP-001.E01", "type_indicator": "OS"}, "type_indicator": "EWF"}, "type_indicator": "TSK_PARTITION"}, "type_indicator": "NTFS"}

# Exercises

## 1 Use elasticsearch to filter events in between 2019-03-12 and 2019-03-22

In [None]:
# Your answer

## 2 Write a query that performs an aggregation on source_long and source_short (can you find the right field names?)

In [None]:
# Your answer

## 3 Combine your date range filter from exercise 1 with facet aggregation in exercise 2

In [None]:
# Your answer

## 4 ***Advanced*** Use opensearch facet aggregation to create a treemap visualisation of a filtered set of events in the index.

####Step 1
The source_short and source_long look interesting for visualisation let's focus on REG, LOG and FILE and run a query. Try the following to see what this looks like:

In [None]:
# Your answer

Why is this not very helpful for a treemap visualisation?

In [None]:
# Your answer

####Step 2
One approach is to aggregate only accross source_long an run 3 separate queries for source_short equal to REG, LOG and FILE respectively and then combine the results into a single dataframe that we can visualise:

In [None]:
# Your answer. If you are struggling you can skip this step and move on to step 3.

####Step 3
Opensearch can also aggregate accros multiple fields combined. This is called multi_terms aggregation which aggregates accross all (source_long, source_short) value pairs. Note: by default opensearch returns 10 buckets max. We set it to 20 here using the size parameter.

See https://opensearch.org/docs/2.0/opensearch/bucket-agg/ for more information about opensearch and multi_term aggregation.

In [None]:
# Your answer

####Step 4
The exercise asked for aggregation accross 3 fields. So let's add the parser field and deepen our treemap visualisation

In [None]:
# Your answer