# Elasticsearch & Warden
Test for searching data over multiple indices 



Lets connect to our elasticsearch and test the connection.  
Run this cell before using the notebook.


In [3]:
from elasticsearch import Elasticsearch

ES = Elasticsearch()

print('Testing connection...')
if ES.ping():
    print('Success!')
else:
    print('No connection...')


Testing connection...
Success!


## Creating an index
Let's start by defining a mapping.  
Mappings: https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping.html#create-mapping  
Data types: https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping-types.html  
  
If you wish to modify the mapping, you can use the put_mapping cell below. No need to create a new index.

In [4]:
from mapping import mappings

Next, we define the settings of our index. The amount of resources we're going to allow a single index.  
For now, we will allow a single shard by index. This setting cannot be changed. If we ever need to expand capacities, we can clone existing indices.  
New indices can be created with different capacities.  
Settings: https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules.html#index-modules-settings

In [5]:
settings = {"number_of_shards": 1,
            "max_inner_result_window": 10000}

We now have all we need to create an index. We'll add the name of the org and the time of creation in the index name.
We'll create three different indices for three different time periods
Creating an index: https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-create-index.html 

In [6]:
import datetime

ES.indices.delete(index='_all')

date1 = int(datetime.datetime(2020,4,6).timestamp())*1000
date2 = int(datetime.datetime(2020,4,16).timestamp())*1000
date3 = int(datetime.datetime(2020,4,25).timestamp())*1000

org_id = 'mondata-mic'
index_name1 = f'{org_id}-{date1}'
index_name2 = f'{org_id}-{date2}'
index_name3 = f'{org_id}-{date3}'
index_names = [index_name1, index_name2, index_name3]
request_body = {'mappings': mappings, 'settings': settings}


for index_name in index_names:
    print(f'Attempting to create indices')
    response = ES.indices.create(index=index_name, body=request_body)
    print('Response: ', response)

Attempting to create indices
Response:  {'acknowledged': True, 'shards_acknowledged': True, 'index': 'mondata-mic-1586145600000'}
Attempting to create indices
Response:  {'acknowledged': True, 'shards_acknowledged': True, 'index': 'mondata-mic-1587009600000'}
Attempting to create indices
Response:  {'acknowledged': True, 'shards_acknowledged': True, 'index': 'mondata-mic-1587787200000'}


## Searching trough different indices by date

Let's get all my organisation indices

In [7]:
indices = ES.indices.get(index=f'{org_id}-*')
print('All organisation indices')
for index_name in indices:
    print(index_name)

All organisation indices
mondata-mic-1586145600000
mondata-mic-1587009600000
mondata-mic-1587787200000


Current Indices creation date:

2020/04/06

2020/04/16

2020/04/25

We can filter indices to search between two date...

In [8]:
from date_indices import get_index_date, filter_indices

date_from = int(datetime.datetime(2020,4,7).timestamp())*1000
date_to = int(datetime.datetime(2020,4,15).timestamp())*1000
new_indices = filter_indices(indices, date_from=date_from, date_to=date_to)
for index in new_indices:
    print(f'index name: {index}, creation date: {datetime.datetime.fromtimestamp(get_index_date(index)/1000)}')


index name: mondata-mic-1586145600000, creation date: 2020-04-06 00:00:00


before a date...

In [9]:
date_to = int(datetime.datetime(2020,4,17).timestamp())*1000
new_indices = filter_indices(indices, date_to=date_to)
for index in new_indices:
    print(f'index name: {index}, creation date: {datetime.datetime.fromtimestamp(get_index_date(index)/1000)}')

index name: mondata-mic-1586145600000, creation date: 2020-04-06 00:00:00
index name: mondata-mic-1587009600000, creation date: 2020-04-16 00:00:00


or after a date...

In [10]:
date_from = int(datetime.datetime(2020,4,17).timestamp())*1000
new_indices = filter_indices(indices, date_from=date_from)
for index in new_indices:
    print(f'index name: {index}, creation date: {datetime.datetime.fromtimestamp(get_index_date(index)/1000)}')
    

index name: mondata-mic-1587787200000, creation date: 2020-04-25 00:00:00
index name: mondata-mic-1587009600000, creation date: 2020-04-16 00:00:00


Being able to filter indices by date, we can also query their content
Let's start by adding some content to our indices

In [11]:
import random
from state import State

# In the first index
for _ in range(10):
    timestamp = int(date1 + random.random()*(date2 - date1))
    state = State(timestamp=timestamp)
    ES.index(index=index_name1, body=state.json())

# In the second index
for _ in range(10):
    timestamp = int(date2 + random.random()*(date3 - date2))
    state = State(timestamp=timestamp)
    ES.index(index=index_name2, body=state.json())
    
# In the third index
for _ in range(10):
    timestamp = int(date3 + random.random()*(date3 - date2))
    state = State(timestamp=timestamp)
    ES.index(index=index_name3, body=state.json())



Our data should be in our index. We can query for the whole index content.


In [12]:
import json

data = ES.search(index=index_name1, body={"query": {"match_all": {}}, "size": 10000})
f = open("query_all_result.txt", "w")
f.write(json.dumps(data, sort_keys=True, indent=4, separators=[',', ':']))
f.close()

Let's query the number of sensitive files per user over the period 2020/04/07 to 2020/04/17

To do this, we must first determine the right indices according to date

In [13]:
from state import EndpointEnum

endpoint = EndpointEnum[0]['name']
indices = ES.indices.get(index=f'{org_id}-*')
date_from = int(datetime.datetime(2020,4,7).timestamp())*1000
date_to = int(datetime.datetime(2020,4,15).timestamp())*1000
search_indices = filter_indices(indices, date_from=date_from, date_to=date_to)

query = {
    'query': {
        'match_all': {}
    },
    'aggregations': {
        'users' : {
            'terms': {'field': 'endpoint_id'},
            'aggregations': {
                'pii_files':{
                    'nested': {
                        'path': 'pii_files'
                    },
                    'aggregations':{
                        'pii_files_number': {
                            'terms' : {
                                'field': 'pii_files.timestamp'
                            }
                        }
                    }
                }
            }
        }
    },
    'size': 0
}

res = ES.search(index=search_indices, body=query)
f = open("aggs_files_timeseries.txt", "w")
f.write(json.dumps(res, sort_keys=True, indent=4, separators=[',', ':']))
f.close()

In [14]:
query = {
    'query': {
        'bool': {
            'filter': [
                { "term":  { "endpoint_id": endpoint }},
            ]
        }
    },
    'aggregations': {
        'pii_files':{
            'nested': {
                'path': 'pii_files'
            },
            'aggregations':{
                'latest': {
                    'terms' : {
                        'field': 'pii_files.timestamp',
                        'order': {'_key': 'desc'},
                        'size': 1,
                    },
                    "aggregations": {
                        "latest_state": {
                            "top_hits": {
                                "from": 0,
                                "size": 10000
                            }
                        }
                    }
                }
            }
        }
    },
    'size': 0
}

res = ES.search(index=search_indices, body=query)
f = open("latest_state.txt", "w")
f.write(json.dumps(res, sort_keys=True, indent=4, separators=[',', ':']))
f.close()

## Using aliases to point out to different indices
Let's create an alias which link to all indices in my organisation
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html

In [16]:
org_alias = 'myOrgAlias'
if ES.indices.exists_alias(name=org_alias):
    ES.indices.delete_alias(index='*', name=org_alias)
ES.indices.update_aliases({
    'actions': [
        {'add': {'indices': index_names, 'alias': org_alias}},
        {'add': {'index': index_name3, 'alias': org_alias, 'is_write_index':True}}
    ]
})

{'acknowledged': True}

See what we can get from this alias!

In [17]:
ES.indices.get_alias(name=org_alias)

{'mondata-mic-1587009600000': {'aliases': {'myOrgAlias': {}}},
 'mondata-mic-1586145600000': {'aliases': {'myOrgAlias': {}}},
 'mondata-mic-1587787200000': {'aliases': {'myOrgAlias': {'is_write_index': True}}}}

Now we can try indexing document in our organisation using our alias and without having to specify the latest index

The document is automatically indexed to the index having 'is_write_index == True'.

Only one index per alias can have this field set to True
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html


In [18]:
timestamp = int(date3 + random.random()*(date3 - date2))
print('state timestamp : ', timestamp)
new_state = State(timestamp=timestamp)
ES.index(index=org_alias, body=new_state.json())

data = ES.search(index=index_name3, body={"query": {"match_all": {}}, "size": 10000})
f = open("query_all_result.txt", "w")
f.write(json.dumps(data, sort_keys=True, indent=4, separators=[',', ':']))
f.close()



state timestamp :  1588444932708


## Using aliases to point the last index in which an endpoint has posted
We can use alias to directly point the index instead of querying over different index 
