New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Records not extracting from specific topics (partitions?) #6
Comments
I had another look into the HDFS file system on the server and I am pretty sure that the missing data is actually in the HDFS. With Also I did not really find any suspicious log entries from that time in the docker-compose logs for |
To restart the extraction for a portion of the data, you can indeed manipulate offsets.csv. Just remove all lines of the acceleration topic (or just the offsets that gave an exception) and run the script again. Could you try to convert one of the files that gave a BlockMissingException to json with AvroTools?
|
|
Edit: I think this is a separate issue, submitted it here: RADAR-base/RADAR-HDFS-Sink-Connector#4
so I restarted the container, and got a bunch of exceptions ending again with that one: log
|
OK I am now thoroughly confused. I had another closer look at the files in hdfs, specifically the newest for the acceleration topic (since there are now new ones after I restarted the container), and converted them to json like above, and noticed that the timestamps were in the past, and actually from later that night after the supposed missing data started. See below a newly created avro file from today, the last json record in that file, and the timestamp of that record:
Same thing for one of the files I mentioned in an earlier #6 (comment):
So to me it currently looks like the connector is slowly getting the data now, almost as if the data arrives as if it were currently collecting. Is this normal that it takes the connector very long to go through kafka and store the data? |
Yes, kafka will leave all data in its log for a while, to allow consumers to catch up if they get behind. If the fall too far behind though, kafka will remove the data altogether. The default time for this is set to 168 hours = 1 week (see Apache Kafka configuration docs, |
Since the retention is set to 1 week, I need to get this data very soon then because the missing section of data is from almost a week ago. Or is there a way to set retention time afterwards? So this morning a bit more data has been written but still not everything that is missing. Instead the connector now gives exceptions like this, even after restarting the container:
and like this
|
It looks like this is the same thing that gave errors before: empty files. See confluentinc/kafka-connect-hdfs#53. To restart, they removed the empty files giving errors. |
ah yes I remember... |
I didn't do anything except |
So, being optimistic and assuming that the missing data is now being completely recovered, what would be possibilities to make sure that the connector will not lag behind like this? It is really quite slow at the moment, processing maybe 1-2 HDFS files per second, while not being load intensive at all.
And being more than one day behind after only two days of recording from 3+ sources will not be feasible in the long run IMO. |
Solved by increasing the Kafka HDFS connector flush size. |
I am having a weird case of data loss at the moment concerning the extraction tool. Using one patient as an example, here is what I am seeing:
Using the tool, I get csv files for topic
android_empatica_e4_acceleration
(showing just the last 3):The problem is that there is missing data from the following night. The recording lasts until the morning after, which is also reflected in other empatica topics, like e.g.
android_empatica_e4_electrodermal_activity
:This is currently the case for the acc and BVP topics.
I know that the here missing data at least went through kafka, since the aggregated data for that timeframe is in the hotstorage:
showing data from the REST-API call get_samples_json, for that particular recording. (Note the date/time x-axis is CEST, not UTC)
Something I noticed in my logfiles from the tool:
I have several
BlockMissingException
s from those two topicsAlthough I have no idea if that could be the source of the problem.
Maybe a particular partition is not extracting, which contains the data?
I am using @fnobilia's script from RADAR-Docker, and have DEBUG enabled in log4j.properties.
The text was updated successfully, but these errors were encountered: