Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Records not extracting from specific topics (partitions?) #6

Closed
sboettcher opened this issue Aug 5, 2017 · 12 comments
Closed

Records not extracting from specific topics (partitions?) #6

sboettcher opened this issue Aug 5, 2017 · 12 comments

Comments

@sboettcher
Copy link
Member

I am having a weird case of data loss at the moment concerning the extraction tool. Using one patient as an example, here is what I am seeing:
Using the tool, I get csv files for topic android_empatica_e4_acceleration (showing just the last 3):

...
-rw-r--r--  1 root root 9,9M 05.08.2017 12:47 20170803_1600.csv
-rw-r--r--  1 root root 9,9M 05.08.2017 12:57 20170803_1700.csv
-rw-r--r--  1 root root 2,7M 05.08.2017 13:07 20170803_1800.csv

The problem is that there is missing data from the following night. The recording lasts until the morning after, which is also reflected in other empatica topics, like e.g. android_empatica_e4_electrodermal_activity:

...
-rw-r--r--  1 root root  1,1M 04.08.2017 19:42 20170804_0400.csv
-rw-r--r--  1 root root  1,1M 04.08.2017 19:42 20170804_0500.csv
-rw-r--r--  1 root root  769K 04.08.2017 19:42 20170804_0600.csv

This is currently the case for the acc and BVP topics.

I know that the here missing data at least went through kafka, since the aggregated data for that timeframe is in the hotstorage:

screenshot from 2017-08-05 13-56-28_cut
showing data from the REST-API call get_samples_json, for that particular recording. (Note the date/time x-axis is CEST, not UTC)

Something I noticed in my logfiles from the tool:
I have several BlockMissingExceptions from those two topics

$ grep -Hin blockmissingexception restructure_log*
restructure_log_01-08-2017_10-59-03:853493:2017-08-01 08:59:01 WARN  DFSClient:981 - Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073809137_68313 file=/topicAndroidNew/android_empatica_e4_blood_volume_pulse/partition=2/android_empatica_e4_blood_volume_pulse+2+0005159476+0005159625.avro No live nodes contain current block Block locations: DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK] Dead nodes:  DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK]. Throwing a BlockMissingException
restructure_log_01-08-2017_10-59-03:853494:2017-08-01 08:59:01 WARN  DFSClient:981 - Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073809137_68313 file=/topicAndroidNew/android_empatica_e4_blood_volume_pulse/partition=2/android_empatica_e4_blood_volume_pulse+2+0005159476+0005159625.avro No live nodes contain current block Block locations: DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK] Dead nodes:  DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK]. Throwing a BlockMissingException
restructure_log_01-08-2017_10-59-03:853496:org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073809137_68313 file=/topicAndroidNew/android_empatica_e4_blood_volume_pulse/partition=2/android_empatica_e4_blood_volume_pulse+2+0005159476+0005159625.avro
restructure_log_01-08-2017_10-59-03:853520:Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073809137_68313 file=/topicAndroidNew/android_empatica_e4_blood_volume_pulse/partition=2/android_empatica_e4_blood_volume_pulse+2+0005159476+0005159625.avro
restructure_log_02-08-2017_10-45-24:911759:2017-08-02 08:45:18 WARN  DFSClient:981 - Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073938426_197602 file=/topicAndroidNew/android_empatica_e4_acceleration/partition=2/android_empatica_e4_acceleration+2+0006341124+0006341273.avro No live nodes contain current block Block locations: DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK] DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] Dead nodes:  DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK]. Throwing a BlockMissingException
restructure_log_02-08-2017_10-45-24:911760:2017-08-02 08:45:18 WARN  DFSClient:981 - Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073938426_197602 file=/topicAndroidNew/android_empatica_e4_acceleration/partition=2/android_empatica_e4_acceleration+2+0006341124+0006341273.avro No live nodes contain current block Block locations: DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK] DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] Dead nodes:  DatanodeInfoWithStorage[172.18.0.4:50010,DS-f79e7ed4-3ae9-4767-9c53-0f30b6018781,DISK] DatanodeInfoWithStorage[172.18.0.2:50010,DS-026efe85-2e06-4c85-9ee3-6db75ee16e86,DISK]. Throwing a BlockMissingException
restructure_log_02-08-2017_10-45-24:911762:org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073938426_197602 file=/topicAndroidNew/android_empatica_e4_acceleration/partition=2/android_empatica_e4_acceleration+2+0006341124+0006341273.avro
restructure_log_02-08-2017_10-45-24:911786:Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1157731097-172.18.0.3-1501484738256:blk_1073938426_197602 file=/topicAndroidNew/android_empatica_e4_acceleration/partition=2/android_empatica_e4_acceleration+2+0006341124+0006341273.avro

Although I have no idea if that could be the source of the problem.
Maybe a particular partition is not extracting, which contains the data?

I am using @fnobilia's script from RADAR-Docker, and have DEBUG enabled in log4j.properties.

@sboettcher
Copy link
Member Author

I had another look into the HDFS file system on the server and I am pretty sure that the missing data is actually in the HDFS. With hdfs dfs -ls I found .avro files in two partitions that were written in that night, and I just picked one of the files and looked into it (hdfs dfs -cat) and I can see the patient ID from that specific recording here data is missing (the actual data is binary obv.).
Is there a way to restart the extraction from scratch for just one topic/ID? Maybe by manipulating the bins.csv and/or offsets.csv?
(I suppose just deleting the whole topics folder would work too but then everything has to be extracted again.)

Also I did not really find any suspicious log entries from that time in the docker-compose logs for hdfs-datanode-1 hdfs-datanode-2 hdfs-namenode radar-hdfs-connector.

@MaximMoinat
Copy link
Collaborator

To restart the extraction for a portion of the data, you can indeed manipulate offsets.csv. Just remove all lines of the acceleration topic (or just the offsets that gave an exception) and run the script again.

Could you try to convert one of the files that gave a BlockMissingException to json with AvroTools?

hadoop fs -get /topicAndroidNew/android_empatica_e4_blood_volume_pulse/partition=2/android_empatica_e4_blood_volume_pulse+2+0005159476+0005159625.avro

java -jar /usr/share/avro-tools/avro-tools-1.7.7.jar tojson android_empatica_e4_blood_volume_pulse+2+0005159476+0005159625.avro

@sboettcher
Copy link
Member Author

tojson works without error message on that file.
Will now try to just rerun the extraction on the topic.

@sboettcher
Copy link
Member Author

sboettcher commented Aug 8, 2017

Edit: I think this is a separate issue, submitted it here: RADAR-base/RADAR-HDFS-Sink-Connector#4
Hm so I saw that radar-hdfs-connector had now stopped with an exception:

radar-hdfs-connector_1     | 2017-08-05T12:54:01.133005441Z [2017-08-05 12:54:01,132] ERROR Task radar-hdfs-sink-android-15000-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133085075Z java.lang.NullPointerException
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133092941Z   at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:296)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133100828Z   at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:121)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133108550Z   at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:302)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133115345Z   at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:435)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133121594Z   at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:147)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133129616Z   at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133137065Z   at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133143866Z   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133155380Z   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133161875Z   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133169393Z   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133177736Z   at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-05T12:54:01.133184102Z [2017-08-05 12:54:01,132] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)

so I restarted the container, and got a bunch of exceptions ending again with that one:

log

radar-hdfs-connector_1     | 2017-08-08T10:58:17.658855180Z Exception in thread "kafka-coordinator-heartbeat-thread | connect-radar-hdfs-sink-android-15000" Exception in thread "LeaseRenewer:root@hdfs-namenode:8020" Exception in thread "kafka-coordinator-heartbeat-thread | connect-radar-hdfs-sink-android-15000" java.lang.OutOfMemoryError: Java heap space
radar-hdfs-connector_1     | 2017-08-08T10:58:17.660211421Z 	at java.util.HashMap$EntrySet.iterator(HashMap.java:1013)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.660228131Z 	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.trySend(ConsumerNetworkClient.java:386)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.660237971Z 	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:223)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.660330193Z 	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.pollNoWakeup(ConsumerNetworkClient.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.660342490Z 	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread.run(AbstractCoordinator.java:865)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.661818388Z java.lang.OutOfMemoryError: Java heap space
radar-hdfs-connector_1     | 2017-08-08T10:58:17.661917082Z [2017-08-08 10:58:17,659] ERROR Task radar-hdfs-sink-android-15000-3 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.661930189Z java.lang.OutOfMemoryError: Java heap space
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662053284Z 	at java.util.ArrayList.iterator(ArrayList.java:834)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662067477Z 	at org.apache.hadoop.hdfs.LeaseRenewer.clientsRunning(LeaseRenewer.java:241)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662076405Z 	at org.apache.hadoop.hdfs.LeaseRenewer.access$500(LeaseRenewer.java:71)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662530537Z [2017-08-08 10:58:17,660] ERROR Task radar-hdfs-sink-android-15000-2 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662543689Z java.lang.NullPointerException
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662551642Z 	at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:296)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662558950Z 	at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:121)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662566144Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:302)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662573083Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:435)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662580007Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:147)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662587186Z 	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662593930Z 	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662600773Z 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662607643Z 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662614597Z 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662622157Z 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.662640133Z 	at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663100448Z [2017-08-08 10:58:17,660] WARN Unexpected error reading responses on connection Thread[IPC Client (248329120) connection to hdfs-namenode/172.18.0.3:8020 from root,5,main] (org.apache.hadoop.ipc.Client)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663114053Z java.lang.OutOfMemoryError: Java heap space
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663122000Z 	at java.lang.Integer.valueOf(Integer.java:832)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663129185Z 	at sun.nio.ch.EPollSelectorImpl.updateSelectedKeys(EPollSelectorImpl.java:120)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663136544Z 	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:98)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663143873Z 	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663151310Z 	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663158981Z 	at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663166127Z 	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663173755Z 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663181039Z 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663188887Z 	at java.io.FilterInputStream.read(FilterInputStream.java:133)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663196501Z 	at java.io.FilterInputStream.read(FilterInputStream.java:133)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663203539Z 	at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663210703Z 	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663217825Z 	at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663224860Z 	at java.io.DataInputStream.readInt(DataInputStream.java:387)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663232252Z 	at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663239095Z 	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663245740Z [2017-08-08 10:58:17,662] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.663252919Z [2017-08-08 10:58:17,661] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.664754302Z 	at org.apache.hadoop.hdfs.LeaseRenewer$Factory.remove(LeaseRenewer.java:154)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.664857528Z 	at org.apache.hadoop.hdfs.LeaseRenewer$Factory.access$800(LeaseRenewer.java:90)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.664878000Z 	at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:312)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.664966845Z 	at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.665451043Z java.lang.OutOfMemoryError: Java heap space
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674685502Z [2017-08-08 10:58:17,672] ERROR Recovery failed at state RECOVERY_PARTITION_PAUSED (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674757976Z org.apache.kafka.connect.errors.ConnectException: java.io.IOException: Failed on local exception: java.io.IOException: Error reading responses; Host Details : local host is: "6d3d8843d9b9/172.18.0.5"; destination host is: "hdfs-namenode":8020; 
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674775466Z 	at io.confluent.connect.hdfs.wal.FSWAL.apply(FSWAL.java:131)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674786782Z 	at io.confluent.connect.hdfs.TopicPartitionWriter.applyWAL(TopicPartitionWriter.java:484)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674797842Z 	at io.confluent.connect.hdfs.TopicPartitionWriter.recover(TopicPartitionWriter.java:212)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674809804Z 	at io.confluent.connect.hdfs.DataWriter.recover(DataWriter.java:239)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674818974Z 	at io.confluent.connect.hdfs.DataWriter.open(DataWriter.java:281)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674828847Z 	at io.confluent.connect.hdfs.HdfsSinkTask.open(HdfsSinkTask.java:116)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674838684Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:431)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674848146Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.access$1000(WorkerSinkTask.java:55)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674904030Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:467)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674915370Z 	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:228)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674925666Z 	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:313)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674935125Z 	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:277)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674944293Z 	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:259)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674953437Z 	at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1013)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674962779Z 	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:979)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674972112Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:317)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.674987616Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:235)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675016059Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:172)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675031307Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:143)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675040607Z 	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675049323Z 	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675057743Z 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675082478Z 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675091712Z 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675100571Z 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675115445Z 	at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675481141Z Caused by: java.io.IOException: Failed on local exception: java.io.IOException: Error reading responses; Host Details : local host is: "6d3d8843d9b9/172.18.0.5"; destination host is: "hdfs-namenode":8020; 
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675522787Z 	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675531882Z 	at org.apache.hadoop.ipc.Client.call(Client.java:1473)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675540120Z 	at org.apache.hadoop.ipc.Client.call(Client.java:1400)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675548463Z 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675557739Z 	at com.sun.proxy.$Proxy47.getFileInfo(Unknown Source)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675566079Z 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675574527Z 	at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675582693Z 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675590784Z 	at java.lang.reflect.Method.invoke(Method.java:498)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675598809Z 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675607116Z 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675616695Z 	at com.sun.proxy.$Proxy48.getFileInfo(Unknown Source)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675624462Z 	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675632361Z 	at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675640551Z 	at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675648690Z 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675669801Z 	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675678825Z 	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675687171Z 	at io.confluent.connect.hdfs.storage.HdfsStorage.exists(HdfsStorage.java:66)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675695703Z 	at io.confluent.connect.hdfs.wal.FSWAL.apply(FSWAL.java:120)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675703914Z 	... 25 more
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675733556Z Caused by: java.io.IOException: Error reading responses
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675742644Z 	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:974)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675750558Z Caused by: java.lang.OutOfMemoryError: Java heap space
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675758148Z 	at java.lang.Integer.valueOf(Integer.java:832)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675766979Z 	at sun.nio.ch.EPollSelectorImpl.updateSelectedKeys(EPollSelectorImpl.java:120)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675826511Z 	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:98)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675846692Z 	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675854309Z 	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675861362Z 	at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675868688Z 	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675875810Z 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675882983Z 	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675890033Z 	at java.io.FilterInputStream.read(FilterInputStream.java:133)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675896759Z 	at java.io.FilterInputStream.read(FilterInputStream.java:133)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675912188Z 	at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675919586Z 	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675926531Z 	at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675933227Z 	at java.io.DataInputStream.readInt(DataInputStream.java:387)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675940313Z 	at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675947123Z 	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.675953636Z [2017-08-08 10:58:17,674] INFO Started recovery for topic partition application_uptime-0 (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:17.759939024Z [2017-08-08 10:58:17,759] INFO Discovered coordinator kafka-2:9092 (id: 2147483645 rack: null) for group connect-radar-hdfs-sink-android-15000. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
radar-hdfs-connector_1     | 2017-08-08T10:58:18.790960667Z [2017-08-08 10:58:18,790] INFO Successfully acquired lease for hdfs://hdfs-namenode:8020/logs/application_uptime/0/log (io.confluent.connect.hdfs.wal.FSWAL)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.222620804Z [2017-08-08 10:58:21,222] WARN DataStreamer Exception (org.apache.hadoop.hdfs.DFSClient)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.222711059Z java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[172.18.0.4:50010, 172.18.0.2:50010], original=[172.18.0.4:50010, 172.18.0.2:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
radar-hdfs-connector_1     | 2017-08-08T10:58:21.222761069Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.222770081Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.222778601Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.222786062Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.223887738Z [2017-08-08 10:58:21,222] ERROR Recovery failed at state WAL_APPLIED (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224044328Z org.apache.kafka.connect.errors.ConnectException: Error closing hdfs://hdfs-namenode:8020/logs/application_uptime/0/log
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224213471Z 	at io.confluent.connect.hdfs.wal.FSWAL.close(FSWAL.java:161)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224240225Z 	at io.confluent.connect.hdfs.wal.FSWAL.truncate(FSWAL.java:143)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224315470Z 	at io.confluent.connect.hdfs.TopicPartitionWriter.truncateWAL(TopicPartitionWriter.java:490)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224339287Z 	at io.confluent.connect.hdfs.TopicPartitionWriter.recover(TopicPartitionWriter.java:215)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224405665Z 	at io.confluent.connect.hdfs.DataWriter.recover(DataWriter.java:239)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224429416Z 	at io.confluent.connect.hdfs.DataWriter.open(DataWriter.java:281)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224496283Z 	at io.confluent.connect.hdfs.HdfsSinkTask.open(HdfsSinkTask.java:116)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224517864Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:431)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224668545Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.access$1000(WorkerSinkTask.java:55)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224702481Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:467)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224825771Z 	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:228)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224863913Z 	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:313)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.224968463Z 	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:277)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225007628Z 	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:259)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225115954Z 	at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1013)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225144781Z 	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:979)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225238780Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:317)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225334044Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:235)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225385164Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:172)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225481965Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:143)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225539996Z 	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225635045Z 	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225662942Z 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225752911Z 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225781015Z 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225889033Z 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.225918844Z 	at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.226067886Z Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[172.18.0.4:50010, 172.18.0.2:50010], original=[172.18.0.4:50010, 172.18.0.2:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
radar-hdfs-connector_1     | 2017-08-08T10:58:21.226090089Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.226098765Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.226105605Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.226112853Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.226119549Z [2017-08-08 10:58:21,223] INFO Started recovery for topic partition android_biovotion_battery_state-0 (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.227944280Z [2017-08-08 10:58:21,227] INFO Finished recovery for topic partition android_biovotion_battery_state-0 (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.227960824Z [2017-08-08 10:58:21,227] INFO Started recovery for topic partition android_biovotion_spo2-0 (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.242479443Z [2017-08-08 10:58:21,242] INFO Finished recovery for topic partition android_biovotion_spo2-0 (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:21.242519511Z [2017-08-08 10:58:21,242] INFO Started recovery for topic partition android_empatica_e4_blood_volume_pulse-0 (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594051287Z [2017-08-08 10:58:43,593] ERROR Task radar-hdfs-sink-android-15000-1 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594178139Z java.lang.NullPointerException
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594321199Z 	at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:296)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594369256Z 	at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:121)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594435121Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:302)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594458350Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:435)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594512221Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:147)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594533581Z 	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594585291Z 	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594606048Z 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594667725Z 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594689178Z 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594744402Z 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594765654Z 	at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:58:43.594815858Z [2017-08-08 10:58:43,593] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.680815652Z Exception in thread "LeaseRenewer:root@hdfs-namenode:8020" java.lang.OutOfMemoryError: Java heap space
radar-hdfs-connector_1     | 2017-08-08T10:59:09.684064012Z [2017-08-08 10:59:09,683] WARN DataStreamer Exception (org.apache.hadoop.hdfs.DFSClient)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.684121728Z java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[172.18.0.2:50010, 172.18.0.4:50010], original=[172.18.0.2:50010, 172.18.0.4:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
radar-hdfs-connector_1     | 2017-08-08T10:59:09.684134622Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.684142125Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.684149445Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.684156588Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685040766Z [2017-08-08 10:59:09,684] ERROR Error closing hdfs://hdfs-namenode:8020/logs/android_empatica_e4_acceleration/0/log. (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685060542Z org.apache.kafka.connect.errors.ConnectException: Error closing hdfs://hdfs-namenode:8020/logs/android_empatica_e4_acceleration/0/log
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685069086Z 	at io.confluent.connect.hdfs.wal.FSWAL.close(FSWAL.java:161)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685096253Z 	at io.confluent.connect.hdfs.TopicPartitionWriter.close(TopicPartitionWriter.java:372)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685104128Z 	at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:296)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685114217Z 	at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:121)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685121639Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:302)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685128947Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:435)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685135976Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:147)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685143131Z 	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685150382Z 	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685156655Z 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685163099Z 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685169246Z 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685175843Z 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685182428Z 	at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685188941Z Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[172.18.0.2:50010, 172.18.0.4:50010], original=[172.18.0.2:50010, 172.18.0.4:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685197143Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685203794Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685211285Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685218128Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685886612Z [2017-08-08 10:59:09,685] ERROR Error closing writer for android_empatica_e4_acceleration-0. Error: Error closing writer: Error closing hdfs://hdfs-namenode:8020/logs/android_empatica_e4_acceleration/0/log
radar-hdfs-connector_1     | 2017-08-08T10:59:09.685900880Z  (io.confluent.connect.hdfs.DataWriter)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686436108Z [2017-08-08 10:59:09,685] ERROR Error closing hdfs://hdfs-namenode:8020/logs/application_uptime/0/log. (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686449148Z org.apache.kafka.connect.errors.ConnectException: Error closing hdfs://hdfs-namenode:8020/logs/application_uptime/0/log
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686457044Z 	at io.confluent.connect.hdfs.wal.FSWAL.close(FSWAL.java:161)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686470374Z 	at io.confluent.connect.hdfs.TopicPartitionWriter.close(TopicPartitionWriter.java:372)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686477672Z 	at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:296)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686484204Z 	at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:121)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686490698Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:302)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686497253Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:435)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686503670Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:147)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686510252Z 	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686516660Z 	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686523122Z 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686529473Z 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686535836Z 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686542115Z 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686548634Z 	at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686554892Z Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[172.18.0.4:50010, 172.18.0.2:50010], original=[172.18.0.4:50010, 172.18.0.2:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686563037Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686569484Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686577152Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686584046Z 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686590655Z [2017-08-08 10:59:09,686] ERROR Error closing writer for application_uptime-0. Error: Error closing writer: Error closing hdfs://hdfs-namenode:8020/logs/application_uptime/0/log
radar-hdfs-connector_1     | 2017-08-08T10:59:09.686597558Z  (io.confluent.connect.hdfs.DataWriter)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687585226Z [2017-08-08 10:59:09,687] ERROR Task radar-hdfs-sink-android-15000-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687599056Z java.lang.NullPointerException
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687606712Z 	at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:296)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687613465Z 	at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:121)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687626278Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:302)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687633619Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:435)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687640233Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:147)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687646783Z 	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687653293Z 	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687659713Z 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687666231Z 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687672577Z 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687679296Z 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687685641Z 	at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-08T10:59:09.687691950Z [2017-08-08 10:59:09,687] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
radar-hdfs-connector_1     | 2017-08-08T10:59:10.190611936Z [2017-08-08 10:59:10,190] INFO Reflections took 103976 ms to scan 265 urls, producing 12687 keys and 83909 values  (org.reflections.Reflections)
looks to me like a memory issue? Could this be related or is this a different issue?

@sboettcher
Copy link
Member Author

OK I am now thoroughly confused. I had another closer look at the files in hdfs, specifically the newest for the acceleration topic (since there are now new ones after I restarted the container), and converted them to json like above, and noticed that the timestamps were in the past, and actually from later that night after the supposed missing data started.

See below a newly created avro file from today, the last json record in that file, and the timestamp of that record:

hdfs: -rw-r--r--   3 root supergroup       9500 2017-08-08 16:06 /topicAndroidNew/android_empatica_e4_acceleration/partition=2/android_empatica_e4_acceleration+2+0021130772+0021130921.avro
json: {"key":{"userId":"UKLFRB2","sourceId":"00:07:80:F2:1A:43"},"value":{"time":1.5017942087017004E9,"timeReceived":1.501794206716E9,"x":-0.8125,"y":-0.59375,"z":-0.03125}}
timeReceived: 08/03/2017 @ 9:03pm (UTC)

Same thing for one of the files I mentioned in an earlier #6 (comment):

hdfs: -rw-r--r--   3 root supergroup       9500 2017-08-04 01:25 /topicAndroidNew/android_empatica_e4_acceleration/partition=2/android_empatica_e4_acceleration+2+0014425524+0014425673.avro
json: {"key":{"userId":"UKLFRB2","sourceId":"00:07:80:F2:1A:43"},"value":{"time":1.501706715380344E9,"timeReceived":1.501706712389E9,"x":-0.015625,"y":-0.328125,"z":0.921875}}
timeReceived: 08/02/2017 @ 8:45pm (UTC)

So to me it currently looks like the connector is slowly getting the data now, almost as if the data arrives as if it were currently collecting. Is this normal that it takes the connector very long to go through kafka and store the data?
I have now disconnected all empaticas again and the connector is still committing data. Will see tomorrow if it finished.
My working theory is that somewhere along the line there was a problem with one of the partitions and that was now "fixed" by restarting the connector and it is now trying to catch up...

@blootsvoets
Copy link
Member

blootsvoets commented Aug 9, 2017

Yes, kafka will leave all data in its log for a while, to allow consumers to catch up if they get behind. If the fall too far behind though, kafka will remove the data altogether. The default time for this is set to 168 hours = 1 week (see Apache Kafka configuration docs, log.retention.<x>). If a connector falls structurally behind (for more than a week), you may loose data though.

@sboettcher
Copy link
Member Author

sboettcher commented Aug 9, 2017

Since the retention is set to 1 week, I need to get this data very soon then because the missing section of data is from almost a week ago. Or is there a way to set retention time afterwards?

So this morning a bit more data has been written but still not everything that is missing. Instead the connector now gives exceptions like this, even after restarting the container:

radar-hdfs-connector_1     | 2017-08-09T08:05:06.103581674Z [2017-08-09 08:05:06,102] ERROR Recovery failed at state RECOVERY_PARTITION_PAUSED (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103679462Z org.apache.kafka.connect.errors.ConnectException: Cannot acquire lease after timeout, will retry.
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103697513Z 	at io.confluent.connect.hdfs.wal.FSWAL.acquireLease(FSWAL.java:95)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103707956Z 	at io.confluent.connect.hdfs.wal.FSWAL.apply(FSWAL.java:105)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103716755Z 	at io.confluent.connect.hdfs.TopicPartitionWriter.applyWAL(TopicPartitionWriter.java:484)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103725341Z 	at io.confluent.connect.hdfs.TopicPartitionWriter.recover(TopicPartitionWriter.java:212)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103734395Z 	at io.confluent.connect.hdfs.DataWriter.recover(DataWriter.java:239)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103743716Z 	at io.confluent.connect.hdfs.DataWriter.open(DataWriter.java:281)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103753421Z 	at io.confluent.connect.hdfs.HdfsSinkTask.open(HdfsSinkTask.java:116)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103762031Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:431)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103770955Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.access$1000(WorkerSinkTask.java:55)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103779925Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:467)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103789981Z 	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:228)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.103799897Z 	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:313)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104201839Z 	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:277)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104237809Z 	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:259)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104316058Z 	at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1013)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104330866Z 	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:979)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104341432Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:317)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104350723Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:235)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104359762Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:172)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104369218Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:143)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104381668Z 	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104391196Z 	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104400040Z 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104409157Z 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104417605Z 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104434756Z 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-09T08:05:06.104444413Z 	at java.lang.Thread.run(Thread.java:745)
and like this

radar-hdfs-connector_1     | 2017-08-09T08:05:39.180107105Z [2017-08-09 08:05:39,179] ERROR Recovery failed at state RECOVERY_PARTITION_PAUSED (io.confluent.connect.hdfs.TopicPartitionWriter)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180164855Z org.apache.kafka.connect.errors.ConnectException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.RecoveryInProgressException): Failed to APPEND_FILE /logs/application_record_counts/1/log for DFSClient_NONMAPREDUCE_-1505132056_27 on 172.18.0.5 because lease recovery is in progress. Try again later.
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180202772Z 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2918)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180213965Z 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2683)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180223592Z 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2982)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180253706Z 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2950)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180263546Z 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:654)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180272330Z 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:421)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180281153Z 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180299276Z 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180309263Z 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180328643Z 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180338174Z 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180346589Z 	at java.security.AccessController.doPrivileged(Native Method)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180355016Z 	at javax.security.auth.Subject.doAs(Subject.java:422)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180363329Z 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180371870Z 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180380057Z 
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180387965Z 	at io.confluent.connect.hdfs.wal.FSWAL.acquireLease(FSWAL.java:88)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180420971Z 	at io.confluent.connect.hdfs.wal.FSWAL.apply(FSWAL.java:105)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180430898Z 	at io.confluent.connect.hdfs.TopicPartitionWriter.applyWAL(TopicPartitionWriter.java:484)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180443378Z 	at io.confluent.connect.hdfs.TopicPartitionWriter.recover(TopicPartitionWriter.java:212)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180453374Z 	at io.confluent.connect.hdfs.DataWriter.recover(DataWriter.java:239)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180462566Z 	at io.confluent.connect.hdfs.DataWriter.open(DataWriter.java:281)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180482966Z 	at io.confluent.connect.hdfs.HdfsSinkTask.open(HdfsSinkTask.java:116)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180492953Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:431)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180501740Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.access$1000(WorkerSinkTask.java:55)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180510326Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:467)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180526073Z 	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:228)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180555500Z 	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:313)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180566039Z 	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:277)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180574784Z 	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:259)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180583469Z 	at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1013)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180592160Z 	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:979)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180600623Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:317)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180609158Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:235)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180617246Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:172)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180635710Z 	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:143)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180644812Z 	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180653678Z 	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180661850Z 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180670274Z 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180678266Z 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180689594Z 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180707444Z 	at java.lang.Thread.run(Thread.java:745)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180718316Z Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.RecoveryInProgressException): Failed to APPEND_FILE /logs/application_record_counts/1/log for DFSClient_NONMAPREDUCE_-1505132056_27 on 172.18.0.5 because lease recovery is in progress. Try again later.
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180728378Z 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2918)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180736994Z 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2683)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180745568Z 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2982)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180754778Z 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2950)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180763072Z 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:654)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180771494Z 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:421)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180796772Z 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180806612Z 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180815206Z 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180823400Z 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180831590Z 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180839768Z 	at java.security.AccessController.doPrivileged(Native Method)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180847622Z 	at javax.security.auth.Subject.doAs(Subject.java:422)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180877268Z 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180886388Z 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180894530Z 
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180902300Z 	at org.apache.hadoop.ipc.Client.call(Client.java:1469)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180910624Z 	at org.apache.hadoop.ipc.Client.call(Client.java:1400)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180918632Z 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180937099Z 	at com.sun.proxy.$Proxy47.append(Unknown Source)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180946318Z 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.append(ClientNamenodeProtocolTranslatorPB.java:313)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180954961Z 	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180964615Z 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180973801Z 	at java.lang.reflect.Method.invoke(Method.java:498)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180982270Z 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180991015Z 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.180999277Z 	at com.sun.proxy.$Proxy48.append(Unknown Source)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181018375Z 	at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1756)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181026889Z 	at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1792)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181034905Z 	at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1785)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181042849Z 	at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:323)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181051198Z 	at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:319)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181059184Z 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181067251Z 	at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:319)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181097242Z 	at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1173)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181107244Z 	at io.confluent.connect.hdfs.wal.WALFile$Writer.(WALFile.java:221)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181115797Z 	at io.confluent.connect.hdfs.wal.WALFile.createWriter(WALFile.java:67)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181124107Z 	at io.confluent.connect.hdfs.wal.FSWAL.acquireLease(FSWAL.java:73)
radar-hdfs-connector_1     | 2017-08-09T08:05:39.181132291Z 	... 26 more

@blootsvoets
Copy link
Member

It looks like this is the same thing that gave errors before: empty files. See confluentinc/kafka-connect-hdfs#53. To restart, they removed the empty files giving errors.

@sboettcher
Copy link
Member Author

ah yes I remember...

@sboettcher
Copy link
Member Author

sboettcher commented Aug 9, 2017

I didn't do anything except hadoop fs -ls /logs/*/*, showing that no log files at all for the releavnt topics existed (also not empty ones), and now it is committing again... maybe the container restart helped anyway

@sboettcher
Copy link
Member Author

So, being optimistic and assuming that the missing data is now being completely recovered, what would be possibilities to make sure that the connector will not lag behind like this? It is really quite slow at the moment, processing maybe 1-2 HDFS files per second, while not being load intensive at all.

root@bd4fef958418:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  4.7 27.8 14484272 6877400 ?    Ssl  08:02  16:02 java -Xms256m -Xmx8g -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -Djava.awt.headless=true -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/connect-log4j.properties -cp /etc/kafka-connect/jars/*:/usr/share/java/confluent-common/*:/usr/share/java/kafka-serde-tools/*:/usr/share/java/monitoring-interceptors/*:/usr/share/java/kafka-connect-elasticsearch/*:/usr/share/java/kafka-connect-hdfs/*:/usr/share/java/kafka-connect-jdbc/*:/usr/bin/../share/java/kafka/*:/usr/bin/../share/java/confluent-support-metrics/*:/usr/share/java/confluent-support-metrics/* org.apache.kafka.connect.cli.ConnectStandalone /etc/kafka-connect/kafka-connect.properties /etc/kafka-connect/sink-hdfs.properties
root     19060  0.0  0.0  21964  3604 pts/0    Ss   13:39   0:00 /bin/bash
root     19357  0.0  0.0  19184  2308 pts/0    R+   13:41   0:00 ps aux

And being more than one day behind after only two days of recording from 3+ sources will not be feasible in the long run IMO.
Might the bottleneck maybe be somewhere else? Although I can't see other containers that have heavy load in cadvisor.

@blootsvoets
Copy link
Member

Solved by increasing the Kafka HDFS connector flush size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants