# How to interract with the OCD Datalake API

#### We'll first look how to authenticate, then how to retrieve threats. We'll look up a specific value, then we'll do a more broad search. Finally we'll see how to retrieve millions of threats at once at the cost of some delay in the response.

In [1]:
# Required imports
import json
from time import sleep

import requests
from urllib.parse import urljoin

In [2]:
# Keep in mind two urls exist, data on each one (including credentials) are fully separated

production = 'https://datalake.cert.orangecyberdefense.com/'
preproduction = 'https://ti.extranet.mrti-center.com/'  # Used for tests only, data should not be considered accurate

api_url = preproduction  # We'll use the preproduction for this tutorial
doc_url = urljoin(api_url, '/api/v2/docs/')
print(f'Full documentation is available at {doc_url}')

raw_swagger = urljoin(doc_url, 'swagger.json')
json_swagger = requests.get(raw_swagger).json()
print(f'Current version: {json_swagger["info"]["version"]}')

Full documentation is available at https://ti.extranet.mrti-center.com/api/v2/docs/
Current version: 2.3.3


The first step of any request is to retrieve an access token, it is valid for 10-15 mins.
 After this delay you will need to generate a new one, either with credentials or with the refresh token

In [None]:
user_email = ''  # Fill me
user_password = ''
auth_url = urljoin(api_url, '/api/v2/auth/token/')
auth_response = requests.post(
    auth_url,
    json={
        "email": user_email,
        "password": user_password
    },
).json()
token = auth_response['access_token']
authentication_header = {'Authorization': f'Token {token}'}

With this token we can retrieve our first threat, for example check if a specific domain exist in Datalake:

In [None]:
lookup_url = urljoin(api_url, '/api/v2/mrti/threats/lookup/')
domain = 'google.fr'
lookup_response = requests.get(
    lookup_url,
    params={
        'atom_value': domain,
        'atom_type': 'domain',
        'hashkey_only': False,  # We want all data, not just if the threat is present or not
    },
    headers=authentication_header,
).json()

# Remove some data for lisibility
lookup_response.pop('tags')
lookup_response.pop('metadata')

print(f'Data about {domain}:')
print(json.dumps(lookup_response, indent=2, sort_keys=True))

That's a lot of data, we can focus on the score for example, with the risk being a value between 0 and 100, 100 meaning confirmed malicious :

In [None]:
print(json.dumps(lookup_response.get('scores'), indent=2, sort_keys=True))

If you want to do a more broad search, then you can use the Advanced Search !

In [None]:
as_url = urljoin(api_url, '/api/v2/mrti/advanced-queries/threats/')
as_response = requests.post(
    as_url,
    json={
        "limit": 10,
        'query_body': {
            "AND": [
                {
                    "AND": [
                        {
                            "field": "atom_type",
                            "multi_values": [
                                "url"
                            ],
                            "type": "filter"
                        },
                        {
                            "field": "last_updated",
                            "type": "filter",
                            "value": "864000"  # 10 days in seconds
                        },
                        {  # We exclude whitelist (score = 0)
                            "field": "risk",
                            "range": {
                                "gt": 0
                            },
                            "type": "filter"
                        },
                        {
                            "atom_values_only": True,
                            "field": "atom_details",
                            "type": "search",
                            "value": "google fr"  # Search value
                        }
                    ]
                }
            ]
        }
    },
    headers=authentication_header,
).json()
print(json.dumps(as_response, indent=2, sort_keys=True))

An easy way to build query body is to first build them using the GUI with. The query body can then be showed by clicking on the link at:
![Link next to "Results for:"](docs/gui_as_body.png)

The id of the query (called a query hash) can also be used.

To retrive more than 5k results at a time, the bulk search must be used.
First, let's check if we have the proper permission:

In [None]:

me_url = urljoin(api_url, 'api/v2/users/me/')
me_response = requests.get(me_url, headers=authentication_header).json()
# Uncomment to see the full response
#print(json.dumps(me_response, indent=2, sort_keys=True))

permissions = me_response.get("role", {}).get("administration_permissions", [])
bs_access = any(p.get('name', '') == 'bulk_search' for p in permissions)
print(f'Has access to bulk search: {bs_access}')

Assuming we have the permission, let's first queue our request:

In [None]:
bs_url = urljoin(api_url, 'api/v2/mrti/bulk-search/')
bs_response = requests.post(
    bs_url,
    json={
        'query_body': {
            "AND": [
                {
                    "AND": [
                        {
                            "field": "atom_type",
                            "multi_values": [
                                "url"
                            ],
                            "type": "filter"
                        },
                        {
                            "field": "last_updated",
                            "type": "filter",
                            "value": "600"  # last 10mins of data
                        },
                        {
                            "atom_values_only": True,
                            "field": "atom_details",
                            "type": "search",
                            "value": "google"
                        }
                    ]
                }
            ]
        }
    },
    headers=authentication_header
).json()
print(json.dumps(bs_response, indent=2, sort_keys=True))
task_uuid = bs_response['task_uuid']  # We get the uuid in the response, it'll be required to have the proper response

# The request can take more or less time, depending our position in the queue
# For this tutorial, we can wait a few minutes, in real product you must set a retry  to check at defined interval
sleep(120)
print('Done waiting')

Assuming the request was processed, we can now retrieve the results (several millions at once) with the `task_uuid`:

In [None]:
bs_result_url = urljoin(api_url, f'api/v2/mrti/bulk-search/task/{task_uuid}/')
bs_result_response = requests.get(bs_result_url, headers=authentication_header).json()

print(f'Results downloaded: {bs_result_response.get("count")}')
print(json.dumps(bs_result_response, indent=2, sort_keys=True))

This request concludes this tutorial, don't hesitate to explore the API documentation.
The GUI and [the CLI](https://github.com/cert-orangecyberdefense/datalake/) are also good examples to see how to interact with the API.
Don't hesitate to contact the Datalake team if you have any question or feedback.