Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Returned data inconsistent with provided "opentsdb.interval" option #4

Open
asavartsov opened this issue Mar 24, 2017 · 6 comments
Open

Comments

@asavartsov
Copy link

asavartsov commented Mar 24, 2017

I'm trying to get specific range of data with spark-opentsdb by providing "opentsdb.interval" but returned data somehow doesn't exactly match interval.

The code I use to insert data

import java.sql.Timestamp
import com.cgnal.spark.opentsdb._

OpenTSDBContext.autoCreateMetrics = true
OpenTSDBContext.saltWidth = 1
OpenTSDBContext.saltBuckets = 4

val csv = spark.sqlContext.read.option("header", "true").option("inferSchema", "true").csv("hdfs:///adr.csv")

val data = csv.map(row => DataPoint("test", row.getAs[Timestamp]("Time").getTime, row.getAs[Double]("Value"), Map("tag" -> "value")))

data.rdd.toDF(spark).write.mode("append").opentsdb

adr.csv contains data from 2017-02-02T09:20:00.000Z (12:20:00 at my timezone, GMT+3) to 2017-02-02T10:20:00.000Z (13:20:00 at my timezone)

The code I use to read data

import org.apache.spark.sql.functions._

import java.sql.Timestamp
import com.cgnal.spark.opentsdb._

import spark.sqlContext.implicits._

OpenTSDBContext.saltWidth = 1
OpenTSDBContext.saltBuckets = 4

val readFrom = new Timestamp(1486026162527L)
val readTo = new Timestamp(1486030255027L)

val interval = s"${readFrom.getTime / 1000}:${readTo.getTime / 1000}"

val adr = spark.sqlContext
  .read
  .options(Map("opentsdb.metric" -> "test", "opentsdb.interval" -> interval))
  .opentsdb
  .orderBy($"timestamp".asc)

z.show(adr)

Results are

image

image

I'm trying to read data from around 12:02 local time, but results start from 13:00. If I mangle from-to values I can get different ranges but they kind of random. Omitting interval option gives all data.

I run code on Spark 2.1.0, Hadoop 2.6.0-cdh5.10.0, HBase 1.2.0-cdh5.10.0, OpenTSDB 2.3.0 in yarn-client mode in Zeppelin notebook, running in local mode in spark shell gives the same results.

@asavartsov
Copy link
Author

adr.zip

My source data file. All values in millisecond resolution.

@dgreco
Copy link
Contributor

dgreco commented Mar 24, 2017 via email

@asavartsov
Copy link
Author

I've discovered a pattern: returning data interval starts at the 'from' boundary rounded to the next hour except cases when 'from' set exactly to 00mm00ss000ms of an hour. For example, if I set from to 12:00:00, everythings ok, if I set it to 12:00:01, I get values starting at 13:00:00 timestamp. 'to' boundary handling works fine. Similarly, on my sample dataset I cat request data from 13:00:00 to 13:00:01, but not in range of 13:00:01 to 13:00:02.

@dgreco
Copy link
Contributor

dgreco commented Mar 24, 2017 via email

@asavartsov
Copy link
Author

Thanks to pointing out to the cause of the problem. Turns out issue may be fixed quite easily on the library side. See referenced pull request.

@dgreco
Copy link
Contributor

dgreco commented Mar 24, 2017

I merged your pull request in the spark-2.x branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants