Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions hive/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
## Apache Hive on a single Parquet file

This setup runs Apache Hive 4 inside Docker, configured with:
- HiveServer2 and an embedded Derby metastore in a single container, and
- Tez as the execution engine (the upstream default for Hive 4),

so the entire benchmark reproduces on a single VM with nothing beyond
Docker installed.

The ClickBench `hits.parquet` file stores `EventTime`, `ClientEventTime`
and `LocalEventTime` as Unix-epoch `BIGINT` values, and `EventDate` as a
`INT` count of days since 1970-01-01. `create.sql` registers the parquet
file as an external table (`hits_raw`) and then exposes a `hits` view
that converts those columns to `TIMESTAMP` and `DATE`, so `queries.sql`
matches the canonical ClickBench query text.

The `results/20130923/` directory contains historical 100M-row and
10M-row results from 2013; the current run targets the standard 100M-row
ClickBench dataset.
10 changes: 10 additions & 0 deletions hive/benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/bin/bash
# Thin shim — actual flow is in lib/benchmark-common.sh.
export BENCH_DOWNLOAD_SCRIPT="download-hits-parquet-single"
# Embedded Derby metastore lives in the container's writable layer;
# the cold-cycle docker rm + docker run in ./start wipes it. ./load
# is idempotent and reruns create.sql every cold cycle so the schema
# is present before the first try; the load wall-clock rolls into the
# cold-try timing per the standard BENCH_DURABLE=no contract.
export BENCH_DURABLE=no
exec ../lib/benchmark-common.sh
8 changes: 8 additions & 0 deletions hive/check
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/bin/bash
set -e

# HiveServer2 exposes a JSON status endpoint on its web UI port (10002).
# /jmx works once the JVM is fully up; while the container is starting
# (or stopped) curl fails. That makes it a clean readiness signal — no
# log-tailing, no false positives across stop+start cycles.
curl -sfo /dev/null http://localhost:10002/jmx
149 changes: 149 additions & 0 deletions hive/create.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
CREATE DATABASE IF NOT EXISTS clickbench;
USE clickbench;

DROP VIEW IF EXISTS hits;
DROP TABLE IF EXISTS hits_raw;

CREATE EXTERNAL TABLE hits_raw (
WatchID bigint,
JavaEnable smallint,
Title string,
GoodEvent smallint,
EventTime bigint,
EventDate int,
CounterID int,
ClientIP int,
RegionID int,
UserID bigint,
CounterClass smallint,
OS smallint,
UserAgent smallint,
URL string,
Referer string,
IsRefresh smallint,
RefererCategoryID smallint,
RefererRegionID int,
URLCategoryID smallint,
URLRegionID int,
ResolutionWidth smallint,
ResolutionHeight smallint,
ResolutionDepth smallint,
FlashMajor smallint,
FlashMinor smallint,
FlashMinor2 string,
NetMajor smallint,
NetMinor smallint,
UserAgentMajor smallint,
UserAgentMinor string,
CookieEnable smallint,
JavascriptEnable smallint,
IsMobile smallint,
MobilePhone smallint,
MobilePhoneModel string,
Params string,
IPNetworkID int,
TraficSourceID smallint,
SearchEngineID smallint,
SearchPhrase string,
AdvEngineID smallint,
IsArtifical smallint,
WindowClientWidth smallint,
WindowClientHeight smallint,
ClientTimeZone smallint,
ClientEventTime bigint,
SilverlightVersion1 smallint,
SilverlightVersion2 smallint,
SilverlightVersion3 int,
SilverlightVersion4 smallint,
PageCharset string,
CodeVersion int,
IsLink smallint,
IsDownload smallint,
IsNotBounce smallint,
FUniqID bigint,
OriginalURL string,
HID int,
IsOldCounter smallint,
IsEvent smallint,
IsParameter smallint,
DontCountHits smallint,
WithHash smallint,
HitColor string,
LocalEventTime bigint,
Age smallint,
Sex smallint,
Income smallint,
Interests smallint,
Robotness smallint,
RemoteIP int,
WindowName int,
OpenerName int,
HistoryLength smallint,
BrowserLanguage string,
BrowserCountry string,
SocialNetwork string,
SocialAction string,
HTTPError smallint,
SendTiming int,
DNSTiming int,
ConnectTiming int,
ResponseStartTiming int,
ResponseEndTiming int,
FetchTiming int,
SocialSourceNetworkID smallint,
SocialSourcePage string,
ParamPrice bigint,
ParamOrderID string,
ParamCurrency string,
ParamCurrencyID smallint,
OpenstatServiceName string,
OpenstatCampaignID string,
OpenstatAdID string,
OpenstatSourceID string,
UTMSource string,
UTMMedium string,
UTMCampaign string,
UTMContent string,
UTMTerm string,
FromTag string,
HasGCLID smallint,
RefererHash bigint,
URLHash bigint,
CLID int
)
STORED AS PARQUET
LOCATION 'file:///clickbench/hits';

-- The Parquet file stores EventTime/ClientEventTime/LocalEventTime as Unix epoch seconds (BIGINT)
-- and EventDate as days since 1970-01-01 (INT). Wrap the raw table in a view that exposes the
-- standard ClickBench types so the queries below need no further adaptation. CAST() of
-- from_unixtime() turns Hive's "yyyy-MM-dd HH:mm:ss" string into a TIMESTAMP, and
-- date_add(DATE'1970-01-01', n) yields a DATE.
CREATE VIEW hits AS
SELECT
WatchID, JavaEnable, Title, GoodEvent,
CAST(from_unixtime(EventTime) AS TIMESTAMP) AS EventTime,
date_add(DATE '1970-01-01', EventDate) AS EventDate,
CounterID, ClientIP, RegionID, UserID, CounterClass, OS, UserAgent,
URL, Referer, IsRefresh, RefererCategoryID, RefererRegionID,
URLCategoryID, URLRegionID, ResolutionWidth, ResolutionHeight,
ResolutionDepth, FlashMajor, FlashMinor, FlashMinor2, NetMajor, NetMinor,
UserAgentMajor, UserAgentMinor, CookieEnable, JavascriptEnable,
IsMobile, MobilePhone, MobilePhoneModel, Params, IPNetworkID,
TraficSourceID, SearchEngineID, SearchPhrase, AdvEngineID, IsArtifical,
WindowClientWidth, WindowClientHeight, ClientTimeZone,
CAST(from_unixtime(ClientEventTime) AS TIMESTAMP) AS ClientEventTime,
SilverlightVersion1, SilverlightVersion2, SilverlightVersion3,
SilverlightVersion4, PageCharset, CodeVersion, IsLink, IsDownload,
IsNotBounce, FUniqID, OriginalURL, HID, IsOldCounter, IsEvent,
IsParameter, DontCountHits, WithHash, HitColor,
CAST(from_unixtime(LocalEventTime) AS TIMESTAMP) AS LocalEventTime,
Age, Sex, Income, Interests, Robotness, RemoteIP, WindowName,
OpenerName, HistoryLength, BrowserLanguage, BrowserCountry,
SocialNetwork, SocialAction, HTTPError, SendTiming, DNSTiming,
ConnectTiming, ResponseStartTiming, ResponseEndTiming, FetchTiming,
SocialSourceNetworkID, SocialSourcePage, ParamPrice, ParamOrderID,
ParamCurrency, ParamCurrencyID, OpenstatServiceName, OpenstatCampaignID,
OpenstatAdID, OpenstatSourceID, UTMSource, UTMMedium, UTMCampaign,
UTMContent, UTMTerm, FromTag, HasGCLID, RefererHash, URLHash, CLID
FROM hits_raw;
5 changes: 5 additions & 0 deletions hive/data-size
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/bin/bash
set -e

# External Parquet table — report the source file size.
stat -c %s data/hits/hits.parquet
24 changes: 24 additions & 0 deletions hive/install
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash
set -e

HIVE_VERSION=4.0.1

# Hive's official image bundles its own JRE; only Docker is needed.
if ! command -v docker >/dev/null 2>&1; then
sudo apt-get update -y
sudo apt-get install -y docker.io
fi
sudo apt-get install -y curl

sudo docker pull apache/hive:${HIVE_VERSION}

# Hive's external-table LOCATION points at /clickbench/hits inside the
# container; that path is the bind mount target for ./data on the host.
# Create the directory now so ./start can mount it before the first
# ./load.
mkdir -p data/hits
# apache/hive runs as uid 1000 ("hive") and writes the embedded Derby
# metastore + warehouse dirs under /opt/hive; the container also reads
# /clickbench/hits to list its external table. Make sure that uid can
# both read and write the bind-mount even when cloud-init runs as root.
sudo chown -R 1000:1000 data
37 changes: 37 additions & 0 deletions hive/load
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/bin/bash
set -e

# Stage hits.parquet under data/hits/ — that dir is the Hive external
# table's LOCATION via the /clickbench bind mount inside the container.
#
# Idempotent: BENCH_DURABLE=no triggers ./load again on every cold
# cycle, but the dataset is 14 GB and re-staging it every cycle would
# blow up the run time without changing the measurement. The first
# invocation moves hits.parquet (delivered into cwd by
# download-hits-parquet-single) into data/hits/; subsequent invocations
# find no source file and reuse the staged copy.
if [ -f hits.parquet ]; then
mkdir -p data/hits
mv -f hits.parquet data/hits/hits.parquet
fi
sudo chown -R 1000:1000 data

# Run create.sql via beeline inside the container. -n hive matches the
# default container user so the external LOCATION is readable; --silent
# suppresses beeline's prompt/timing chrome which would otherwise leak
# into ./load's stdout and confuse the driver's load-time parser.
sudo docker cp create.sql hive:/tmp/create.sql
# `< /dev/null` is load-bearing: bench_main runs ./load inside
# `while read query; do ... done < queries.sql`, so our inherited
# stdin IS the queries.sql fd. With -i, docker exec forwards host
# stdin into the container until EOF; beeline (running with -f) never
# reads it, so docker silently drains queries.sql while waiting for
# beeline to consume nothing. The next bench_main read then hits EOF
# and the whole query loop exits after Q1, with no error message
# (Q1's [t,t,t] is the only timing in the log, then it jumps straight
# to data-size). Redirecting stdin from /dev/null isolates this
# docker call from the surrounding loop's input.
sudo docker exec -i hive beeline -u 'jdbc:hive2://localhost:10000/' -n hive \
--silent=true -f /tmp/create.sql < /dev/null

sync
43 changes: 43 additions & 0 deletions hive/queries.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
SELECT COUNT(*) FROM hits;
SELECT COUNT(*) FROM hits WHERE AdvEngineID <> 0;
SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits;
SELECT AVG(UserID) FROM hits;
SELECT COUNT(DISTINCT UserID) FROM hits;
SELECT COUNT(DISTINCT SearchPhrase) FROM hits;
SELECT MIN(EventDate), MAX(EventDate) FROM hits;
SELECT AdvEngineID, COUNT(*) FROM hits WHERE AdvEngineID <> 0 GROUP BY AdvEngineID ORDER BY COUNT(*) DESC;
SELECT RegionID, COUNT(DISTINCT UserID) AS u FROM hits GROUP BY RegionID ORDER BY u DESC LIMIT 10;
SELECT RegionID, SUM(AdvEngineID), COUNT(*) AS c, AVG(ResolutionWidth), COUNT(DISTINCT UserID) FROM hits GROUP BY RegionID ORDER BY c DESC LIMIT 10;
SELECT MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM hits WHERE MobilePhoneModel <> '' GROUP BY MobilePhoneModel ORDER BY u DESC LIMIT 10;
SELECT MobilePhone, MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM hits WHERE MobilePhoneModel <> '' GROUP BY MobilePhone, MobilePhoneModel ORDER BY u DESC LIMIT 10;
SELECT SearchPhrase, COUNT(*) AS c FROM hits WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;
SELECT SearchPhrase, COUNT(DISTINCT UserID) AS u FROM hits WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY u DESC LIMIT 10;
SELECT SearchEngineID, SearchPhrase, COUNT(*) AS c FROM hits WHERE SearchPhrase <> '' GROUP BY SearchEngineID, SearchPhrase ORDER BY c DESC LIMIT 10;
SELECT UserID, COUNT(*) FROM hits GROUP BY UserID ORDER BY COUNT(*) DESC LIMIT 10;
SELECT UserID, SearchPhrase, COUNT(*) FROM hits GROUP BY UserID, SearchPhrase ORDER BY COUNT(*) DESC LIMIT 10;
SELECT UserID, SearchPhrase, COUNT(*) FROM hits GROUP BY UserID, SearchPhrase LIMIT 10;
SELECT UserID, EXTRACT(MINUTE FROM EventTime) AS m, SearchPhrase, COUNT(*) FROM hits GROUP BY UserID, EXTRACT(MINUTE FROM EventTime), SearchPhrase ORDER BY COUNT(*) DESC LIMIT 10;
SELECT UserID FROM hits WHERE UserID = 435090932899640449;
SELECT COUNT(*) FROM hits WHERE URL LIKE '%google%';
SELECT SearchPhrase, MIN(URL), COUNT(*) AS c FROM hits WHERE URL LIKE '%google%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;
SELECT SearchPhrase, MIN(URL), MIN(Title), COUNT(*) AS c, COUNT(DISTINCT UserID) FROM hits WHERE Title LIKE '%Google%' AND URL NOT LIKE '%.google.%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;
SELECT * FROM hits WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10;
SELECT SearchPhrase FROM hits WHERE SearchPhrase <> '' ORDER BY EventTime LIMIT 10;
SELECT SearchPhrase FROM hits WHERE SearchPhrase <> '' ORDER BY SearchPhrase LIMIT 10;
SELECT SearchPhrase FROM hits WHERE SearchPhrase <> '' ORDER BY EventTime, SearchPhrase LIMIT 10;
SELECT CounterID, AVG(length(URL)) AS l, COUNT(*) AS c FROM hits WHERE URL <> '' GROUP BY CounterID HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\\.)?([^/]+)/.*$', '$1') AS k, AVG(length(Referer)) AS l, COUNT(*) AS c, MIN(Referer) FROM hits WHERE Referer <> '' GROUP BY REGEXP_REPLACE(Referer, '^https?://(?:www\\.)?([^/]+)/.*$', '$1') HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
SELECT SUM(ResolutionWidth), SUM(ResolutionWidth + 1), SUM(ResolutionWidth + 2), SUM(ResolutionWidth + 3), SUM(ResolutionWidth + 4), SUM(ResolutionWidth + 5), SUM(ResolutionWidth + 6), SUM(ResolutionWidth + 7), SUM(ResolutionWidth + 8), SUM(ResolutionWidth + 9), SUM(ResolutionWidth + 10), SUM(ResolutionWidth + 11), SUM(ResolutionWidth + 12), SUM(ResolutionWidth + 13), SUM(ResolutionWidth + 14), SUM(ResolutionWidth + 15), SUM(ResolutionWidth + 16), SUM(ResolutionWidth + 17), SUM(ResolutionWidth + 18), SUM(ResolutionWidth + 19), SUM(ResolutionWidth + 20), SUM(ResolutionWidth + 21), SUM(ResolutionWidth + 22), SUM(ResolutionWidth + 23), SUM(ResolutionWidth + 24), SUM(ResolutionWidth + 25), SUM(ResolutionWidth + 26), SUM(ResolutionWidth + 27), SUM(ResolutionWidth + 28), SUM(ResolutionWidth + 29), SUM(ResolutionWidth + 30), SUM(ResolutionWidth + 31), SUM(ResolutionWidth + 32), SUM(ResolutionWidth + 33), SUM(ResolutionWidth + 34), SUM(ResolutionWidth + 35), SUM(ResolutionWidth + 36), SUM(ResolutionWidth + 37), SUM(ResolutionWidth + 38), SUM(ResolutionWidth + 39), SUM(ResolutionWidth + 40), SUM(ResolutionWidth + 41), SUM(ResolutionWidth + 42), SUM(ResolutionWidth + 43), SUM(ResolutionWidth + 44), SUM(ResolutionWidth + 45), SUM(ResolutionWidth + 46), SUM(ResolutionWidth + 47), SUM(ResolutionWidth + 48), SUM(ResolutionWidth + 49), SUM(ResolutionWidth + 50), SUM(ResolutionWidth + 51), SUM(ResolutionWidth + 52), SUM(ResolutionWidth + 53), SUM(ResolutionWidth + 54), SUM(ResolutionWidth + 55), SUM(ResolutionWidth + 56), SUM(ResolutionWidth + 57), SUM(ResolutionWidth + 58), SUM(ResolutionWidth + 59), SUM(ResolutionWidth + 60), SUM(ResolutionWidth + 61), SUM(ResolutionWidth + 62), SUM(ResolutionWidth + 63), SUM(ResolutionWidth + 64), SUM(ResolutionWidth + 65), SUM(ResolutionWidth + 66), SUM(ResolutionWidth + 67), SUM(ResolutionWidth + 68), SUM(ResolutionWidth + 69), SUM(ResolutionWidth + 70), SUM(ResolutionWidth + 71), SUM(ResolutionWidth + 72), SUM(ResolutionWidth + 73), SUM(ResolutionWidth + 74), SUM(ResolutionWidth + 75), SUM(ResolutionWidth + 76), SUM(ResolutionWidth + 77), SUM(ResolutionWidth + 78), SUM(ResolutionWidth + 79), SUM(ResolutionWidth + 80), SUM(ResolutionWidth + 81), SUM(ResolutionWidth + 82), SUM(ResolutionWidth + 83), SUM(ResolutionWidth + 84), SUM(ResolutionWidth + 85), SUM(ResolutionWidth + 86), SUM(ResolutionWidth + 87), SUM(ResolutionWidth + 88), SUM(ResolutionWidth + 89) FROM hits;
SELECT SearchEngineID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits WHERE SearchPhrase <> '' GROUP BY SearchEngineID, ClientIP ORDER BY c DESC LIMIT 10;
SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits WHERE SearchPhrase <> '' GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;
SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;
SELECT URL, COUNT(*) AS c FROM hits GROUP BY URL ORDER BY c DESC LIMIT 10;
SELECT 1, URL, COUNT(*) AS c FROM hits GROUP BY 1, URL ORDER BY c DESC LIMIT 10;
SELECT ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3, COUNT(*) AS c FROM hits GROUP BY ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3 ORDER BY c DESC LIMIT 10;
SELECT URL, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= DATE '2013-07-01' AND EventDate <= DATE '2013-07-31' AND DontCountHits = 0 AND IsRefresh = 0 AND URL <> '' GROUP BY URL ORDER BY PageViews DESC LIMIT 10;
SELECT Title, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= DATE '2013-07-01' AND EventDate <= DATE '2013-07-31' AND DontCountHits = 0 AND IsRefresh = 0 AND Title <> '' GROUP BY Title ORDER BY PageViews DESC LIMIT 10;
SELECT URL, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= DATE '2013-07-01' AND EventDate <= DATE '2013-07-31' AND IsRefresh = 0 AND IsLink <> 0 AND IsDownload = 0 GROUP BY URL ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;
SELECT TraficSourceID, SearchEngineID, AdvEngineID, CASE WHEN (SearchEngineID = 0 AND AdvEngineID = 0) THEN Referer ELSE '' END AS Src, URL AS Dst, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= DATE '2013-07-01' AND EventDate <= DATE '2013-07-31' AND IsRefresh = 0 GROUP BY TraficSourceID, SearchEngineID, AdvEngineID, CASE WHEN (SearchEngineID = 0 AND AdvEngineID = 0) THEN Referer ELSE '' END, URL ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;
SELECT URLHash, EventDate, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= DATE '2013-07-01' AND EventDate <= DATE '2013-07-31' AND IsRefresh = 0 AND TraficSourceID IN (-1, 6) AND RefererHash = 3594120000172545465 GROUP BY URLHash, EventDate ORDER BY PageViews DESC LIMIT 10 OFFSET 100;
SELECT WindowClientWidth, WindowClientHeight, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= DATE '2013-07-01' AND EventDate <= DATE '2013-07-31' AND IsRefresh = 0 AND DontCountHits = 0 AND URLHash = 2868770270353813622 GROUP BY WindowClientWidth, WindowClientHeight ORDER BY PageViews DESC LIMIT 10 OFFSET 10000;
SELECT FLOOR_MINUTE(EventTime) AS M, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= DATE '2013-07-14' AND EventDate <= DATE '2013-07-15' AND IsRefresh = 0 AND DontCountHits = 0 GROUP BY FLOOR_MINUTE(EventTime) ORDER BY FLOOR_MINUTE(EventTime) LIMIT 10 OFFSET 1000;
22 changes: 22 additions & 0 deletions hive/query
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash
# Reads a SQL query from stdin, runs it via beeline against HiveServer2
# in the running container.
# Stdout: query result.
# Stderr: query runtime in fractional seconds on the last line.
# Exit non-zero on error.
set -e

query=$(cat)

start=$(date +%s.%N)
# `< /dev/null`: see hive/load for the long version. bench_run_query
# pipes the query in via `printf | ./query`, so the printf pipe is
# already drained by `query=$(cat)` above and our stdin is at EOF
# here — but make the docker-exec stdin source explicit so this
# script stays safe if anyone ever calls it without the printf-pipe
# wrapping.
sudo docker exec -i hive beeline -u 'jdbc:hive2://localhost:10000/clickbench' -n hive \
--silent=true --outputformat=tsv2 -e "$query" < /dev/null
end=$(date +%s.%N)

awk -v s="$start" -v e="$end" 'BEGIN { printf "%.3f\n", e - s }' >&2
Loading