# Work with string data using scalar functions
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Introductory paragraph - for example:

This tutorial demonstrates how to work with [feature](link to feature doc). In this tutorial you perform the following tasks:

- Task 1
- Task 2
- Task 3
- etc

## Prerequisites

This tutorial works with Druid 27.0.0 or later.

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Load example data

Run the following cell to create a table called `example-koalas-strings`. Notice {the use of X as a timestamp | only required columns are ingested | WHERE / expressions / GROUP BY are front-loaded | partitions on X period and clusters by Y}.

When completed, you'll see a description of the final table.

In [None]:
sql='''
REPLACE INTO "example-koalas-strings" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "agent_category",
  "agent_type",
  "browser",
  "browser_version",
  "city",
  "continent",
  "country",
  "version",
  "event_type",
  "event_subtype",
  "loaded_image",
  "adblock_list",
  "forwarded_for",
  MV_TO_ARRAY("language") AS "language",
  "number",
  "os",
  "path",
  "platform",
  "referrer",
  "referrer_host",
  "region",
  "remote_address",
  "screen",
  "session",
  "session_length",
  "timezone",
  "timezone_offset",
  "window"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-koalas-strings')
display.table('example-koalas-strings')

<!-- Include these cells if you need additional Python modules -->

### Import additional modules

Run the following cell to import additional Python modules that you will use to X, Y, Z.

In [None]:
# Add your modules here, remembering to align this with the prerequisites section

import json
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
## Find things in a string

In [None]:
Run the next cell to find any row in the table with a Google referrer.

In [None]:
'''sql

'''

display.sql(sql)



POSITION / STRPOS

In [None]:
REGEXP_LIKE

In [None]:
CONTAINS_STRING

In [None]:
ICONTAINS_STRING

In [None]:
## Manipulate values


In [None]:
UPPER / LOWER

In [None]:
STRING_FORMAT

In [None]:
TRIM

In [None]:
REVERSE

In [None]:
## Add things to a string

In [None]:
CONCAT / TEXTCAT

In [None]:
REGEXP_REPLACE

In [None]:
REPLACE

In [None]:
REPEAT

In [None]:
LPAD

In [None]:
RPAD

In [None]:
## Get bits out of a string

The next cell uses the `POSITION`, `RIGHT`, and `LEFT` functions to find the horizontal and vertical screen size of the user.

In [None]:
sql='''
SELECT
  LEFT("screen",POSITION('x' in "screen")-1) AS "x-size",
  RIGHT("screen",LENGTH("screen")-POSITION('x' in "screen")) AS "y-size"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T14/PT1H')
LIMIT 10
'''

display.sql(sql)

In [None]:
Alternatively, we might use a regular expression.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'PT1H') AS "interval",
  AVG(REGEXP_EXTRACT("screen",'([0-9]*)x([0-9]*)',1)) AS "x-size-average",
  AVG(REGEXP_EXTRACT("screen",'([0-9]*)x([0-9]*)',2)) AS "y-size-average"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T0/PT12H')
GROUP BY 1
'''

display.sql(sql)


Run the following query to return only the filename from the image Url in the data.

In [None]:
sql='''
SELECT
  REGEXP_EXTRACT("loaded_image",'[^/\\&\?]+\.\w{3,4}(?=([\?&].*$|$))') AS "filename",
  COUNT(*) AS "events"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T14/PT1H')
GROUP BY 1
'''

df = pd.DataFrame(sql_client.sql(sql))
df.plot.barh(x='filename', y='events')
plt.show()

The next cell contains a SQL statement that uses a regular expression with multiple matches.

In [None]:
sql='''
SELECT
  REGEXP_EXTRACT("loaded_image",'^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?',2) AS "scheme",
  REGEXP_EXTRACT("loaded_image",'^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?',5) AS "path",
  COUNT(DISTINCT "browser") AS "events"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T14/PT1H')
GROUP BY 1, 2
'''

df = pd.DataFrame(sql_client.sql(sql))
df_group=df.groupby(['path','scheme']).sum().unstack()
df_group.plot.bar(stacked="true")
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
SUBSTRING

## Clean up

Run the following cell to remove the XXX used in this notebook from the database.

In [None]:
druid.datasources.drop("example-koalas-strings")

## Summary

* You learned this
* Remember this

## Learn more

* Try this out on your own data
* Solve for problem X that is't covered here
* Read docs pages
* Watch or read something cool from the community
* Do some exploratory stuff on your own

In [None]:
# STANDARD CODE BLOCKS

# When just wanting to display some SQL results
display.sql(sql)

# When ingesting data:
display.run_task(sql)
sql_client.wait_until_ready('example-koalas-strings')
display.table('example-koalas-strings')

# When you want to make an EXPLAIN look pretty
print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), indent=2))

# When you want a simple plot
df = pd.DataFrame(sql_client.sql(sql))
df.plot(x='x-axis', y='y-axis', marker='o')
plt.xticks(rotation=45, ha='right')
plt.gca().get_legend().remove()
plt.show()

# When you want to add some query context parameters
req = sql_client.sql_request(sql)
req.add_context("useApproximateTopN", "false")
resp = sql_client.sql_query(req)

# When you want to compare two different sets of results
df3 = df1.compare(df2, keep_equal=True)
df3