# Tutorial: Learn the basics of the Druid API

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->
  
This tutorial introduces you to the basics of the Druid API and some of the endpoints you might use frequently, including the following tasks:

- Checking if your cluster is up
- Ingesting data
- Querying data
- Deleting data

In a Druid deployment, you have [Mastery, Query, and Data servers](https://druid.apache.org/docs/latest/design/processes.html#server-types) that all fulfill different purposes. The endpoint you use for a certain action is determined, partially, by which server governs that part of Druid and the processes that run on that server type. That's why the [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html#historical) is organized by server type and process.

## Table of contents

- [Before you start](#Before-you-start)
- [Get basic cluster information](#Get-basic-cluster-information)
- [Ingest data](#Ingest-data)
- [Query your data](#Query-your-data)
- [Manage your data](#Manage-your-data)
- [Next steps](#Next-steps)

For the best experience, use Jupyter Lab so that you can always access the table of contents.

## Before you start

You'll need install the Requests library for Python before you start. For example:

```bash
pip3 install requests
```

Next, you'll need a Druid cluster. This tutorial uses the `micro-quickstart` config described in the [Druid quickstart](https://druid.apache.org/docs/latest/tutorials/index.html). So download that and start it if you haven't already. In the root of the Druid folder, run the following command to start Druid:

```bash
./bin/start-micro-quickstart
```

Both the quickstart Druid cluster and Jupyter notebook are deployed at `localhost:8888` by default, so you'll 
need to change the port for Jupyter. To do so, stop Jupyter and start it again with the `port` parameter included. For example, you can use the following command to start Jupyter on port `3001`:

```bash
# If you're using Jupyter lab
jupyter lab --port 3001
# If you're using Jupyter notebook
jupyter notebook --port 3001 
```

To start this tutorial, run the next cell. It imports the Python packages you'll need and defines a variable for the the Druid host the tutorial uses. The quickstart deployment configures Druid to listen on port `8888` by default, so you'll be making API calls against `http://localhost:8888`. 

In [5]:
import requests
import json

# druid_host is the hostname and port for your Druid deployment. 
druid_host = "http://localhost:8888"
dataSourceName = "wikipedia-api"
print(druid_host)

http://localhost:8888


In the rest of this tutorial, the `endpoint`, `http_method`, and `payload` variables are updated in code cells to call a different Druid endpoint to accomplish a task.

## Get basic cluster information

In this cell, you'll use the `GET /status` API to return basic information about your cluster, such as the Druid version, loaded extensions, and resource consumption.

The following cell sets `endpoint` to `/status` and updates the HTTP method to `GET`. When you run the cell, you should get a response that starts with the version number of your Druid deployment.

In [None]:
endpoint = "/status"
print(druid_host+endpoint)
http_method = "GET"

payload = {}
headers = {}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
print(json.dumps(json.loads(response.text), indent=4))

### Get cluster health

The `/status/health` returns `true` if your cluster is up and running. It's useful if you want to do something like programmatically check if your cluster is available. When you run the following cell, you should get `true` if your Druid cluster has finished starting up and is running.

In [5]:
# GET 
endpoint = "/status/health"
print(druid_host+endpoint)
http_method = "GET"

payload = {}
headers = {}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)

print(response.text)

http://localhost:8888/status/health
true


## Ingest data

Now that you've confirmed that your cluster is up and running, you can start ingesting data. There are different ways to ingest data based on what your needs are. For more information, see [Ingestion methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods).

This tutorial uses the multi-stage query (MSQ) task engine and its `sql/task` endpoint to perform SQL-based ingestion. The `/sql/task` endpoint accepts [SQL requests in the JSON-over-HTTP format](https://druid.apache.org/docs/latest/querying/sql-api.html#request-body) using the query, context, and parameters fields

To learn more about SQL-based ingestion, see [SQL-based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html). For information about the endpoint specifically, see [SQL-based ingestion and multi-stage query task API](https://druid.apache.org/docs/latest/multi-stage-query/api.html).


The next cell does the following:

- Includes a payload that inserts data from an external source into a table named wikipedia-api. The payload is in JSON format and included in the code directly. You can also store it in a file and provide the file. 
- Saves the response to a unique variable that you can reference later to identify this ingestion task

The example uses INSERT, but you could also use REPLACE. 

For the MSQ task engine, ingesting data is done through a task, so the response includes a `taskId` and `state` for your ingestion. You can use this `taskId` to reference this task later on to get more information about it.

In [73]:
endpoint = "/druid/v2/sql/task"
print(druid_host+endpoint)
http_method = "POST"


payload = json.dumps({
  "query": "INSERT INTO wikipedia-api\nSELECT\n  TIME_PARSE(\"timestamp\") AS __time,\n  *\nFROM TABLE(\n  EXTERN(\n    '{\"type\": \"http\", \"uris\": [\"https://druid.apache.org/data/wikipedia-api.json.gz\"]}',\n    '{\"type\": \"json\"}',\n    '[{\"name\": \"added\", \"type\": \"long\"}, {\"name\": \"channel\", \"type\": \"string\"}, {\"name\": \"cityName\", \"type\": \"string\"}, {\"name\": \"comment\", \"type\": \"string\"}, {\"name\": \"commentLength\", \"type\": \"long\"}, {\"name\": \"countryIsoCode\", \"type\": \"string\"}, {\"name\": \"countryName\", \"type\": \"string\"}, {\"name\": \"deleted\", \"type\": \"long\"}, {\"name\": \"delta\", \"type\": \"long\"}, {\"name\": \"deltaBucket\", \"type\": \"string\"}, {\"name\": \"diffdruid_host\", \"type\": \"string\"}, {\"name\": \"flags\", \"type\": \"string\"}, {\"name\": \"isAnonymous\", \"type\": \"string\"}, {\"name\": \"isMinor\", \"type\": \"string\"}, {\"name\": \"isNew\", \"type\": \"string\"}, {\"name\": \"isRobot\", \"type\": \"string\"}, {\"name\": \"isUnpatrolled\", \"type\": \"string\"}, {\"name\": \"metroCode\", \"type\": \"string\"}, {\"name\": \"namespace\", \"type\": \"string\"}, {\"name\": \"page\", \"type\": \"string\"}, {\"name\": \"regionIsoCode\", \"type\": \"string\"}, {\"name\": \"regionName\", \"type\": \"string\"}, {\"name\": \"timestamp\", \"type\": \"string\"}, {\"name\": \"user\", \"type\": \"string\"}]'\n  )\n)\nPARTITIONED BY DAY",
  "context": {
    "maxNumTasks": 3
  }
})

headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
ingestiion_taskId_response = response
print(response.text + f"\nInserting data into the table named {dataSourceName}.")

http://localhost:8888/druid/v2/sql/task
{"taskId":"query-73f5ab42-5e85-4dd2-a03f-5b939846d624","state":"RUNNING"}
Inserting data into the table named wikipedia.


Extract the `taskId` value from the `taskId_response` variable so that you can reference it later:

In [10]:
ingestion_taskId = json.loads(ingestiion_taskId_response.text)['taskId']
print(ingestion_taskId)

query-14f2ffc5-9cba-43e5-a3ea-9bcc18f4afe3


### Get the status of your task

The following cell shows you how to get the status of your ingestion.  You can see basic information about your query, such as when it started and whether or not it's finished.

In addition to the status, you can retrieve a full report about it if you want using `GET /druid/indexer/v1/task/TASK_ID/reports`. But you won't need that information for this tutorial.

In [96]:
endpoint = f"/druid/indexer/v1/task/{ingestion_taskId}/status"
print(druid_host+endpoint)
http_method = "GET"

payload = {}
headers = {}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
print(json.dumps(json.loads(response.text), indent=4))

http://localhost:8888/druid/indexer/v1/task/query-14f2ffc5-9cba-43e5-a3ea-9bcc18f4afe3/status
{
    "task": "query-14f2ffc5-9cba-43e5-a3ea-9bcc18f4afe3",
    "status": {
        "id": "query-14f2ffc5-9cba-43e5-a3ea-9bcc18f4afe3",
        "groupId": "query-14f2ffc5-9cba-43e5-a3ea-9bcc18f4afe3",
        "type": "query_controller",
        "createdTime": "2022-10-25T16:44:19.282Z",
        "queueInsertionTime": "1970-01-01T00:00:00.000Z",
        "statusCode": "SUCCESS",
        "status": "SUCCESS",
        "runnerStatusCode": "WAITING",
        "duration": 96474,
        "location": {
            "host": "localhost",
            "port": 8100,
            "tlsPort": -1
        },
        "dataSource": "wikipedia",
        "errorMsg": null
    }
}


## Query your data

When you ingest data into Druid, Druid stores the data in a datasource, and this datasource is what you run queries against.

### List your datasources

You can get a list of datasources from the `/druid/coordinator/v1/datasources` endpoint. Since you're just getting started, there should only be a single datasource, the `wikipedia-api` table you created earlier when you ingested external data.

In [15]:
endpoint = "/druid/coordinator/v1/datasources"
print(druid_host+endpoint)
http_method = "GET"

payload = {}
headers = {}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
print(response.text)

http://localhost:8888/druid/coordinator/v1/datasources
["wikipedia"]


### Query your data

Now, you can query the data. Because this tutorial is running in Jupyter, make sure to limit the size of your query results using `LIMIT`. For example, the following cell selects all columns but limits the results to 3 rows for display purposes.


In [217]:
endpoint = "/druid/v2/sql"
print(druid_host+endpoint)
http_method = "POST"

payload = json.dumps({
  "query": "SELECT  * FROM wikipedia-api LIMIT 3"
})
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)

print(json.dumps(json.loads(response.text), indent=4))



http://localhost:8888/druid/v2/sql
[
    {
        "__time": "2016-06-27T00:00:11.080Z",
        "added": 31,
        "channel": "#sv.wikipedia",
        "cityName": "",
        "comment": "Botskapande Indonesien omdirigering",
        "commentLength": 35,
        "countryIsoCode": "",
        "countryName": "",
        "deleted": 0,
        "delta": 31,
        "deltaBucket": "0.0",
        "diffUrl": "https://sv.wikipedia.org/w/index.php?oldid=36099284&rcid=89369918",
        "flags": "NB",
        "isAnonymous": "false",
        "isMinor": "false",
        "isNew": "true",
        "isRobot": "true",
        "isUnpatrolled": "false",
        "metroCode": "",
        "namespace": "Main",
        "page": "Salo Toraut",
        "regionIsoCode": "",
        "regionName": "",
        "timestamp": "2016-06-27T00:00:11.080Z",
        "user": "Lsjbot",
        "diffdruid_host": ""
    },
    {
        "__time": "2016-06-27T00:00:17.457Z",
        "added": 125,
        "channel": "#ja.wikiped

In addition to the query, there are a few additional things you can define within the payload. For a full list, see [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)

This tutorial uses a context parameter and a dynamic parameter.

Context parameters can control certain characteristics related to a query, such as configuring a custom timeout. For information, see [Context parameters](https://druid.apache.org/docs/latest/querying/query-context.html). In the example query that follows, the context block assigns a custom `sqlQueryID` to the query. Typically, the `sqlQueryId` is autogenerated. With a custom ID, you can use it to reference the query more easily like when you need to cancel a query.


Druid supports dynamic parameters, so you can either define certain parameters within the query explicitly or insert a `?` as a placeholder and define it in a parameters block. In the following cell, the `?` gets bound to the timestmap value of `2016-06-27` at execution time. For more information, see [Dynamic parameters](https://druid.apache.org/docs/latest/querying/sql.html#dynamic-parameters).


The following cell selects rows where the `__time` column contains a value greater than the value defined dynamically in `parameters` and sets a custom `sqlQueryId`.

In [105]:
endpoint = "/druid/v2/sql"
print(druid_host+endpoint)
http_method = "POST"

payload = json.dumps({
  "query": "SELECT * FROM wikipedia-api WHERE __time > ? LIMIT 1",
  "context": {
    "sqlQueryId" : "important-query" 
    },
  "parameters": [
    { "type": "TIMESTAMP", "value": "2016-06-27"}
  ]
})
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
print(json.dumps(json.loads(response.text), indent=4))

http://localhost:8888/druid/v2/sql
[
    {
        "__time": "2016-06-27T00:00:11.080Z",
        "added": 31,
        "channel": "#sv.wikipedia",
        "cityName": "",
        "comment": "Botskapande Indonesien omdirigering",
        "commentLength": 35,
        "countryIsoCode": "",
        "countryName": "",
        "deleted": 0,
        "delta": 31,
        "deltaBucket": "0.0",
        "diffUrl": "https://sv.wikipedia.org/w/index.php?oldid=36099284&rcid=89369918",
        "flags": "NB",
        "isAnonymous": "false",
        "isMinor": "false",
        "isNew": "true",
        "isRobot": "true",
        "isUnpatrolled": "false",
        "metroCode": "",
        "namespace": "Main",
        "page": "Salo Toraut",
        "regionIsoCode": "",
        "regionName": "",
        "timestamp": "2016-06-27T00:00:11.080Z",
        "user": "Lsjbot",
        "diffdruid_host": ""
    }
]


## Manage your data

Besides running queries against data you ingest, you can also retrieve information related to your datasouce or manage its lifecycle. There's a lot you can do, so this tutorial will explore somee basics around segments. Druid stores its data and indexes in segments partitioned by time. 

### Segment metadata

To retrieve a list of segments for your datasource, use the `druid/coordinator/v1/metadata/segments` endpoint with your `?datasources=YOUR_DATASOURCE` appended to the end:

```HTTP
druid/coordinator/v1/metadata/segments?datasources=YOUR_DATASOURCE
```


You can include multiple datasources by appending `&?datasources=YOUR_DATASOURCE` for each additional datasource. Alternatively, you can omit it, and get all segments for all datasources. You can modify the information you get by changing the path. For more information about working with segments using the API, see [Metadata store information](https://druid.apache.org/docs/latest/operations/api-reference.html#metadata-store-information).

For a higher level view, you can also retrieve the total number of segments for a datasource like in the following cell. Learning more about segments lets you do things such as marking certain segments unused if you're not interested in referencing that data and want to delete it (either manually or automatically by setting rules).


In [155]:
endpoint = f"/druid/coordinator/v1/datasources/{dataSourceName}/intervals?simple"
print(druid_host+endpoint)
http_method = "GET"

payload = {}
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
print(json.dumps(json.loads(response.text), indent=4))

http://localhost:8888/druid/coordinator/v1/datasources/wikipedia/intervals?simple
{
    "2016-06-27T00:00:00.000Z/2016-06-28T00:00:00.000Z": {
        "size": 70040079,
        "count": 11
    }
}


Now that you know that there are 11 total segments, you can see a full list of them 

In [165]:
endpoint = f"/druid/coordinator/v1/metadata/segments"
print(druid_host+endpoint)
http_method = "GET"

payload = {}
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
print(json.dumps(json.loads(response.text), indent=4))

http://localhost:8888/druid/coordinator/v1/metadata/segments
[
    {
        "dataSource": "wikipedia",
        "interval": "2016-06-27T00:00:00.000Z/2016-06-28T00:00:00.000Z",
        "version": "2022-10-24T19:45:23.151Z",
        "loadSpec": {
            "type": "local",
            "path": "/Users/brian/apache-druid-24.0.0/var/druid/segments/wikipedia/2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z/2022-10-24T19:45:23.151Z/0/index.zip"
        },
        "dimensions": "added,channel,cityName,comment,commentLength,countryIsoCode,countryName,deleted,delta,deltaBucket,diffUrl,flags,isAnonymous,isMinor,isNew,isRobot,isUnpatrolled,metroCode,namespace,page,regionIsoCode,regionName,timestamp,user",
        "metrics": "",
        "shardSpec": {
            "type": "numbered",
            "partitionNum": 0,
            "partitions": 0
        },
        "binaryVersion": 9,
        "size": 8017416,
        "identifier": "wikipedia_2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z_2022-10-2

### Mark a segment unused

> For a complete tutorial on marking data as unused and deleting it, see [Tutorial](https://druid.apache.org/docs/latest/tutorials/tutorial-delete-data.html).

The first step to deleting data is to mark it unused. To mark a specific segment unused, you need the segment IDs from the `identifier` field or the time intervals from the `interval` field. The following example marks one segment as unused by specifying the segment ID. The response confirms that by returning the fact that 1 segment has changed. 

Make sure you update the following cell to specify a segment you want to delete by providing the following:

- the segment ID for `segment_to_delete` 
- the interval for `interval_to_delete`

In [216]:
endpoint = f"/druid/coordinator/v1/datasources/{dataSourceName}/markUnused"
print(druid_host+endpoint)
http_method = "POST"
# Set this to the segmentID that you want to delete
segment_to_delete = "wikipedia-api_2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z_2022-10-24T19:45:23.151Z_10"
# Set this to the interval represented by the segmentID
interval_to_delete = "2016-06-27T00:00:00.000Z/2016-06-28T00:00:00.000Z"

# Provide the segment ID as a comma separated list
payload = json.dumps({"segmentIds": [f"{segment_to_delete}"]})
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
print(response.text)

http://localhost:8888/druid/coordinator/v1/datasources/wikipedia/markUnused
{"numChangedSegments":0}


### Mark a segment as used

The following example marks a segment as used and makes its data available again. It's identical to marking a segment as unused except for the last part of the endpoint.

In [211]:
endpoint = f"/druid/coordinator/v1/datasources/{dataSourceName}/markUsed"
print(druid_host+endpoint)
http_method = "POST"

# Provide the segment ID as a comma separated list
payload = json.dumps({"segmentIds": [f"{segment_to_delete}"]})
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
print(response.text)

http://localhost:8888/druid/coordinator/v1/datasources/wikipedia/markUsed
{"message":"Cannot find segment ids [wikipedia_2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z_2022-10-24T19:45:23.151Z_10]"}


### Verify segment is unused

You can confirm that the segment isn't available anymore by rerunning the cell referenced in [Segment metadata](#Segment-metadata). You'll notice that the segment count has decreased by one. Alternatively, you can retrieve the information for that specific segment.


The following cell returns a 404 if the segment is unused or unknown, so you should only see the print statement with the endpoint. Try it after marking a cell as unused by rerunning [Mark a segment unused](#Mark-a-segment-unused).


In [210]:
endpoint = f"/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments/wikipedia-api_2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z_2022-10-24T19:45:23.151Z_10"
print(druid_host+endpoint)
http_method = "GET"

payload = {}
headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
print(response.text)

http://localhost:8888/druid/coordinator/v1/metadata/datasources/wikipedia/segments/wikipedia_2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z_2022-10-24T19:45:23.151Z_10



### Delete data permanently

Now that a segment is marked as unused, you can delete delete just that specific segment from your datasource by running a kill task. If you want to delete a datasource completely, you'd use `DELETE /druid/coordinator/v1/datasources/{dataSourceName}`.

The following cell deletes the segment you marked as unused earlier and returns the taskID for the kill task.

In [213]:
endpoint = f"/druid/indexer/v1/task"
print(druid_host+endpoint)
http_method = "POST"

payload = json.dumps({
  "type": "kill",
  "dataSource": "wikipedia-api",
  "interval": "2016-06-27T00:00:00.000Z/2016-06-28T00:00:00.000Z"
})

headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
print(response.text)

http://localhost:8888/druid/indexer/v1/task
{"task":"kill_wikipedia_jbpndpif_2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z_2022-10-26T19:50:18.673Z"}


Now, if you try to [mark the same segment as used](#Mark-a-segment-as-used), you'll get an error message about not being able to find the segment ID. The data contained in the segment has been permanently deleted.

You can also check the status of the kill task by referncing the task ID.

In [215]:
endpoint = f"/druid/indexer/v1/task/kill_wikipedia-api_jbpndpif_2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z_2022-10-26T19:50:18.673Z/status"
print(druid_host+endpoint)
http_method = "GET"

payload = {}

headers = {'Content-Type': 'application/json'}

response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)
print(json.dumps(json.loads(response.text), indent=4))

http://localhost:8888/druid/indexer/v1/task/kill_wikipedia_jbpndpif_2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z_2022-10-26T19:50:18.673Z/status
{
    "task": "kill_wikipedia_jbpndpif_2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z_2022-10-26T19:50:18.673Z",
    "status": {
        "id": "kill_wikipedia_jbpndpif_2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z_2022-10-26T19:50:18.673Z",
        "groupId": "kill_wikipedia_jbpndpif_2016-06-27T00:00:00.000Z_2016-06-28T00:00:00.000Z_2022-10-26T19:50:18.673Z",
        "type": "kill",
        "createdTime": "2022-10-26T19:50:18.692Z",
        "queueInsertionTime": "1970-01-01T00:00:00.000Z",
        "statusCode": "SUCCESS",
        "status": "SUCCESS",
        "runnerStatusCode": "WAITING",
        "duration": 29441,
        "location": {
            "host": "localhost",
            "port": 8100,
            "tlsPort": -1
        },
        "dataSource": "wikipedia",
        "errorMsg": null
    }
}


## Next steps

This tutorial covers the some of the basics related to the Druid API. To learn more about the kinds of things you can do, see the API documentation:

- [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)
- [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html)

You can also try out the [druid-client](https://github.com/paul-rogers/druid-client), a Python library for Druid created by a Druid contributor.



