Returned data inconsistent with provided "opentsdb.interval" option #4

asavartsov · 2017-03-24T08:26:19Z

I'm trying to get specific range of data with spark-opentsdb by providing "opentsdb.interval" but returned data somehow doesn't exactly match interval.

The code I use to insert data

import java.sql.Timestamp
import com.cgnal.spark.opentsdb._

OpenTSDBContext.autoCreateMetrics = true
OpenTSDBContext.saltWidth = 1
OpenTSDBContext.saltBuckets = 4

val csv = spark.sqlContext.read.option("header", "true").option("inferSchema", "true").csv("hdfs:///adr.csv")

val data = csv.map(row => DataPoint("test", row.getAs[Timestamp]("Time").getTime, row.getAs[Double]("Value"), Map("tag" -> "value")))

data.rdd.toDF(spark).write.mode("append").opentsdb

adr.csv contains data from 2017-02-02T09:20:00.000Z (12:20:00 at my timezone, GMT+3) to 2017-02-02T10:20:00.000Z (13:20:00 at my timezone)

The code I use to read data

import org.apache.spark.sql.functions._

import java.sql.Timestamp
import com.cgnal.spark.opentsdb._

import spark.sqlContext.implicits._

OpenTSDBContext.saltWidth = 1
OpenTSDBContext.saltBuckets = 4

val readFrom = new Timestamp(1486026162527L)
val readTo = new Timestamp(1486030255027L)

val interval = s"${readFrom.getTime / 1000}:${readTo.getTime / 1000}"

val adr = spark.sqlContext
  .read
  .options(Map("opentsdb.metric" -> "test", "opentsdb.interval" -> interval))
  .opentsdb
  .orderBy($"timestamp".asc)

z.show(adr)

Results are

I'm trying to read data from around 12:02 local time, but results start from 13:00. If I mangle from-to values I can get different ranges but they kind of random. Omitting interval option gives all data.

I run code on Spark 2.1.0, Hadoop 2.6.0-cdh5.10.0, HBase 1.2.0-cdh5.10.0, OpenTSDB 2.3.0 in yarn-client mode in Zeppelin notebook, running in local mode in spark shell gives the same results.

The text was updated successfully, but these errors were encountered:

asavartsov · 2017-03-24T08:33:08Z

adr.zip

My source data file. All values in millisecond resolution.

dgreco · 2017-03-24T08:54:49Z

Hi Alexey, I'll give it a look, in the test cases I tried to check exactly this kind of scenarios. Sometime managing the timezone can be tricky. BTW, adding adding additional tests definitely helps. I'll let you know David

…

Sent from my iPhone

On 24 Mar 2017, at 9:33 AM, Alexey Savartsov ***@***.***> wrote: adr.zip My source data file. All values in millisecond resolution. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

asavartsov · 2017-03-24T11:21:30Z

I've discovered a pattern: returning data interval starts at the 'from' boundary rounded to the next hour except cases when 'from' set exactly to 00mm00ss000ms of an hour. For example, if I set from to 12:00:00, everythings ok, if I set it to 12:00:01, I get values starting at 13:00:00 timestamp. 'to' boundary handling works fine. Similarly, on my sample dataset I cat request data from 13:00:00 to 13:00:01, but not in range of 13:00:01 to 13:00:02.

dgreco · 2017-03-24T11:46:43Z

In fact googling around the opentsdb documentation I understood that the row key is generated from the hour than all the metrics for that hour are in the same row, so somehow the data are organised per hour. So, it shouldn’t be a defect in my implementation right but we should understand how to formulate the right query. You could try to run a similar query passing through the daemon. If you run the daemon on the same hbase instance don’t forget to configure it with the right number of salting buckets. David

…

On 24 Mar 2017, at 12:21, Alexey Savartsov ***@***.***> wrote: I've discovered a pattern: returning data interval starts at the 'from' boundary rounded to the next hour except cases when 'from' set exactly to 00mm00ss000ms of an hour. For example, if I set from to 12:00:00, everythings ok, if I set it to 12:00:01, I get values starting at 13:00:00 timestamp. 'to' boundary handling works fine. Similarly, on my sample dataset I cat request data from 13:00:00 to 13:00:01, but not in range of 13:00:01 to 13:00:02. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHuwb8RcJhpb2ytkF646hJQcXeiJ2wzks5ro6c6gaJpZM4Mn4s->.

asavartsov · 2017-03-24T13:32:04Z

Thanks to pointing out to the cause of the problem. Turns out issue may be fixed quite easily on the library side. See referenced pull request.

dgreco · 2017-03-24T18:34:13Z

I merged your pull request in the spark-2.x branch

asavartsov mentioned this issue Mar 24, 2017

Reading metric fails with IllegalStateException on spark-2.x branch #5

Open

asavartsov mentioned this issue Mar 24, 2017

Use row base time for scan #6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Returned data inconsistent with provided "opentsdb.interval" option #4

Returned data inconsistent with provided "opentsdb.interval" option #4

asavartsov commented Mar 24, 2017 •

edited

Loading

asavartsov commented Mar 24, 2017

dgreco commented Mar 24, 2017 via email

asavartsov commented Mar 24, 2017

dgreco commented Mar 24, 2017 via email

asavartsov commented Mar 24, 2017

dgreco commented Mar 24, 2017

Returned data inconsistent with provided "opentsdb.interval" option #4

Returned data inconsistent with provided "opentsdb.interval" option #4

Comments

asavartsov commented Mar 24, 2017 • edited Loading

asavartsov commented Mar 24, 2017

dgreco commented Mar 24, 2017 via email

asavartsov commented Mar 24, 2017

dgreco commented Mar 24, 2017 via email

asavartsov commented Mar 24, 2017

dgreco commented Mar 24, 2017

asavartsov commented Mar 24, 2017 •

edited

Loading