# Elasticsearch & Warden
This notebook will illustrate how to organize our data in elasticsearch.



Lets connect to our elasticsearch and test the connection.  
Run this cell before using the notebook.


In [1]:
from elasticsearch import Elasticsearch

ES = Elasticsearch()

print('Testing connection...')
if ES.ping():
    print('Success!')
else:
    print('No connection...')

Testing connection...
Success!


## Creating an index
Let's start by defining a mapping.  
Mappings: https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping.html#create-mapping  
Data types: https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping-types.html  
  
If you wish to modify the mapping, you can use the put_mapping cell below. No need to create a new index.

In [2]:
mappings = {
    'dynamic': 'strict',  # Makes it so the index will reject data if it does not respect the mapping
    'properties': {
            'timestamp': {'type': 'date'},
            'endpoint_id': {'type': 'keyword'},
            'build_number': {'type': 'keyword'},

            'pii_files': {
                'type': 'nested',
                'properties': {

                    'path': {'type': 'text'},        # This level represents a list of pii files
                    'score': {'type': 'float'},

                    'mime_type': {'type': 'keyword'},
                    'hash': {'type': 'keyword'},
                    'encrypted': {'type': 'boolean'},
                    'timestamp': {'type': 'date'},
                    
                    'content': {
                        'type': 'nested',
                        'properties': {

                            'type_name': {'type': 'text'},    # This level represents the content of a pii file
                            'type_id': {'type': 'keyword'},   # Type name and id represent the name and id of the corresponding RegExs
                            'amount': {'type': 'integer'},
                            'correlations': {
                                'type': 'nested',
                                'properties': {

                                    'type_name': {'type': 'text'},
                                    'type_id': {'type': 'keyword'},              # This level represents the correlation level
                                    'correlation': {'type': 'float'}       # This structure allows for correlation on multiple entities, not just names
                                }
                            }
                        }}
                }}

        }
    }


Next, we define the settings of our index. The amount of resources we're going to allow a single index.  
For now, we will allow a single shard by index. This setting cannot be changed. If we ever need to expand capacities, we can clone existing indices.  
New indices can be created with different capacities.  
Settings: https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules.html#index-modules-settings

In [3]:
settings = {"number_of_shards": 1}

We now have all that we need to create an index. We'll add the name of the org and the time of creation in the index name.  
Creating an index: https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-create-index.html 

In [4]:
import time

org_id = 'mondata'
created_at = int(round(time.time()*1000, 0)) # ms since epoch

index_name = f'{org_id}-{created_at}'
request_body = {'mappings': mappings, 'settings': settings}

print(f'Attempting to create index: {index_name}')

response = ES.indices.create(index=index_name, body=request_body)

print('Response: ', response)

index_id = response['index']

Attempting to create index: mondata-1586205467235
Response:  {'acknowledged': True, 'shards_acknowledged': True, 'index': 'mondata-1586205467235'}


In [5]:
index_id = 'mondata-1586202184430' # If the notebook had to be restarted

It is possible to modify the mapping of an index after creation.   
putMapping: https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-put-mapping.html

In [6]:
ES.indices.put_mapping(mappings, index=index_id)

NotFoundError: NotFoundError(404, 'index_not_found_exception', 'no such index [mondata-1586202184430]', mondata-1586202184430, index_or_alias)

## Uploading data

Let's generate some fake data. By using the state class we can generate realistic fake data to test our queries.  
This will mimic 3 endpoints posting once a day for five days.


In [7]:
from state import *

fake_data = get_state_timeserie(number_of_endpoints=3, states_per_endpoint=5) # format [ [state, ...], [state, ...], ... ]


Next, we put the data in our index. If the data structure corresponds to the mapping, the request will work. Else it will be rejected.  
This is a result of setting the 'dynamic': 'strict'  parameter in our mapping.  
Putting documents: https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-index_.html

In [None]:
for timeseries in fake_data:
    [ES.index(index=index_id, body=state.json()) for state in timeseries]


Our data should be in our index. We can query for the content of the whole index.


In [None]:
data = ES.search(index=index_id, body={"query": {"match_all": {}}, "size": 10000})
f = open("query_all_result.txt", "a")
f.write(json.dumps(data, sort_keys=True, indent=4, separators=[',', ':']))
f.close()

We can delete a document or all documents with the following lines

In [None]:
# ES.delete(index=index_id, id='9qeWSHEBD8xa8m9AXTWA')
# ES.delete_by_query(index=index_id, body={"query": {"match_all": {}}})

Now that we can upload data to our index however we want, we can start writting and testing queries.  


## Querying the index  
The first query we're interested in making is a simple query by endpoint_id. Since the endpoint_id field is mapped as a **keyword**, we can do a **term** query.  
Term queries match to exact strings. We can also sort by timestamp and ask for a certain number of states.
  
Also, we're using the **filter** context. The filter context is prefered over the query context since it offers caching fonctionnality and should speed up performances.   
The query context offers the relevance score, which we have no use for.  
  
Results from the queries are large, some queries we will save in text files.
 

In [None]:
endpoint = EndpointEnum[0]['name']
last_x_states = 1 

query = {
    'query': {
        'bool': {
            'filter': [
                { "term":  { "endpoint_id": endpoint }},
            ]
        }
    },
    'size': last_x_states,
    'sort': [{
        'timestamp': {
            'order': 'desc'
        }
    }]
}

res = ES.search(index=index_id, body=query)
f = open("query_endpoint_id_result.txt", "a")
f.write(json.dumps(res, sort_keys=True, indent=4, separators=[',', ':']))
f.close()

We can also query **nested fields**. In the next example, we will look for a particular file hash in our whole index.  
We could also look for the same file hash on a particular endpoint. 


In [None]:
query = {
    'query': {
        'nested': {
            'path': 'pii_files',
            'query': {
                'bool': {
                    'filter': [
                        {
                            'term': {'pii_files.hash': '33f75cfe-7842-11ea-9e8a-9cb6d08b03d4'} # This one you have to find by hand ...
                        }
                    ]
                }
            }
        }
    }
}

res = ES.search(index=index_id, body=query)
f = open("query_hash_result.txt", "a")
f.write(json.dumps(res, sort_keys=True, indent=4, separators=[',', ':']))
f.close()

We can even do **double nested queries**. For exemple, we can look for every states with a file that has 100 or more instances of a specific type of sensitive information (type_id).

In [None]:
type_id = RegexEnum[0]['guid']
amount = 100

query = {
    'query': {
        'nested': {
            'path': 'pii_files',
            'query': {
                'nested': {
                    'path': 'pii_files.content',
                    'query': {
                        'bool': {
                            'filter': [
                                {
                                    'term': {'pii_files.content.type_id': type_id}
                                },
                                {
                                    'range': {'pii_files.content.amount': {'gte': 100}}
                                }
                            ]
                        }
                    }
                }
            }
        }
    }
}

res = ES.search(index=index_id, body=query)
f = open("query_type_id_result.txt", "a")
f.write(json.dumps(res, sort_keys=True, indent=4, separators=[',', ':']))
f.close()

Now that we've estabilished that we can easily make the queries we want, let's attempt a few aggregations.  


## Aggregations  
Let's try to compute the maximum score for an endpoint's latest state. We will run an aggregation on the pii_files field, and ask for the maximum score.  
This is an exemple of a **nested aggregation**.

In [None]:
endpoint = EndpointEnum[0]['name']


query = {
    'query': {
        'bool': {
            'filter': [
                { "term":  { "endpoint_id": endpoint }},
            ]
        }
    },
    'aggregations': {
        "piifiles" : {
            'nested': {
                'path': 'pii_files'
            },
            'aggregations': {
                'max_score': {'max': {'field': "pii_files.score"}}
            }
        }
    },
    'size': 1,
    'sort': [{
        'timestamp': {
            'order': 'desc'
        }
    }]
}


res = ES.search(index=index_id, body=query)
f = open("aggs_max_score_result.txt", "a")
f.write(json.dumps(res, sort_keys=True, indent=4, separators=[',', ':']))
f.close()

In [4]:
print('nic est gay')


nic est gay
