Run pipelines on LUH cluster #62

eichelbe · 2017-01-18T11:05:46Z

For testing and backup-plan
(1) Run RandomPip on LUH cluster (SUH)
(2) Replace storm to Adaptive Storm* and change worker configuration for larger pipelines (SUH)
(3) Run RandomPip on LUH cluster again (SUH)
(4) Run TransferPip on LUH cluster (LUH)
(5) Run FocusPip on LUH cluster (LUH)
(6) For TSI: Run Time Travel Pipeline for startup time debugging on LUH cluster (SUH + TSI)
(7) Rund TransferPip + FocusPip on LUH cluster
(8) Test QM-IConf connection into LUH cluster (requires SSH tunneling)
(9) Test Application connection into LUH cluster (requires SSH tunneling)

Please document process below!
After (3) we will communicate how to work with the cluster on the project Wiki!

*required for adaptation and for identifying timing issues (extension over storm, can be further extended if needed)

eichelbe · 2017-01-18T11:10:33Z

Cui is working on (1)

QM infrastructure setup seems to be ok, however workers are restarting even on RandomPip
Seems to be a storm serialization problem. Local storm app folders not cleared! Need to restart all workers.
Needed to replace Storm (Adaptive Storm is installed) as well as the infrastructure installation.

eichelbe · 2017-01-18T13:27:27Z

-> RandomPip is running
(1)+(2)+(3) are done

smutahoang · 2017-01-18T13:42:05Z

@eichelbe @cuiqin That's great! Thanks for your effort. We look forward for the wiki page on how to run the pipelines on LUH clusters so that we can proceed with the next steps.

eichelbe · 2017-01-18T14:03:32Z

;) It's also in our interest....

And here is the extended LUH cluster wikipage (the lower parts are mostly "dangerous"...)

ap0n · 2017-01-18T14:44:29Z

Great!
So, how do we continue? Do we get accounts on LUH cluster or do we coordinate with SUH and test together (mainly interested in (6) 👼)?

eichelbe · 2017-01-18T15:52:57Z

Not sure, this is LUH decision. We had to sign some papers that we loose all our money if we download their data ;) Just kidding, I think we will take your pipeline and run it there for you reporting on the issues that we see.

cuiqin · 2017-01-18T16:25:02Z

Adaptive switch over RandomPip is working.

eichelbe · 2017-01-19T11:10:51Z

As a follow-up - LUH infrastructure is now working with coordination.commandCompletion.onEvent = true. Will become active with next restart.

eichelbe · 2017-01-19T11:12:32Z

For (4)+(5) the financial data source may need an adjustment to read from the file system rather than HDFS (not available in LUH cluster). Either there is a way to convince Miroslav or we have to modify the TSI source. Initial discussion with Christoph regarding the Twitter source (works with file system) are ongoing to synchronize the work and the approach...

ap0n · 2017-01-19T12:21:33Z

But can't the Okeanos HDFS be also used? (It will be slower because of the internet but it could be an alternative at least for testing...)

L3SQualimaster · 2017-01-19T13:01:52Z

On Thu, 19 Jan 2017 03:12:32 -0800 Holger Eichelberger <notifications@github.com> wrote: Hi Holger, [storm@master02 ~]$ hadoop fs -put ./test.hdfs /data/storm/ [storm@master02 ~]$ hadoop fs -ls /data/storm/ Found 1 items -rw-r--r-- 3 storm users 5 2017-01-19 13:57 /data/storm/test.hdfs [storm@master02 ~]$ hadoop fs -cat /data/storm/test.hdfs test Cheers, miroslav

…

For (4)+(5) the financial data source may need an adjustment to read from the file system rather than HDFS (not available in LUH cluster). Either there is a way to convince Miroslav or we have to modify the TSI source. Initial discussion with Christoph regarding the Twitter source (works with file system) are ongoing to synchronize the work and the approach...

-- Dr. Miroslav Shaltev - Systemadministration - Forschungszentrum L3S Appelstrasse 9a 30167 Hannover Phone: +49 (0)511 762 - 17777 Fax: +49 (0)511 762 - 17779 E-Mail: shaltev@l3s.de

eichelbe · 2017-01-19T13:32:28Z

Cool. From our meeting I had in mind that we do not have access to HDFS. Then let's keep things simple, put data under /data/storm in HDFS and configure the infrastructure respectively. I will tell TSI...

ap0n · 2017-01-19T15:52:51Z

Just to let you know, today I spotted another attack at our HDFS; this time data ransom was included! According to this, these attacks are quite widespread.

eichelbe · 2017-01-20T10:26:06Z

How bad, as the cluster of LUH is behind an external server, there the risk shall be lower.

@L3SQualimaster any news on the FocusPip?

eichelbe · 2017-01-23T07:41:47Z

Infrastructure setup has been changed to take up the LUH hdfs with base path /data/storm as suggested by Miroslav (thanks again). The respective setup information is being passed to the worker by the infrastructure via the DML configuration class (getHdfsUrl(), getHdfsPath() as postfix to the HDFS URL, getSimulationLocalPath(), useSimulationHdfs() as well as getDfsPath()). Infrastructure update and restart of infrastructure are needed.

It's now up to the sources to take up this information and to copy the required data into the HDFS. How is the time plan there?

ap0n · 2017-01-23T08:00:23Z

SpringClientSimulator is modified to use these settings (committed about half an hour ago)

eichelbe · 2017-01-23T08:02:42Z

Fine :)

eichelbe · 2017-01-23T08:26:13Z

BTW, the HDFS user/group setup is not done, so the infrastructure may complain about that while extracting the setup files for Stefan. But as far as I know, Tuan is trying to find a way how the Stakeholder applications could access the pipeline results, then we can also figure out whether extracting the setup files also holds/is needed in the LUH setup.

eichelbe · 2017-01-23T17:57:10Z

... the installation needs a further classpath entry for HDFS. Discussing with Miroslav...

ChristophHubeL3S · 2017-01-24T09:14:32Z

Hi everyone, I got an error while trying to start the infrastructure on hadoop2:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
at eu.qualimaster.dataManagement.storage.hdfs.HdfsUtils.getFilesystem(HdfsUtils.java:97)
at eu.qualimaster.dataManagement.storage.hdfs.HdfsUtils.getFilesystem(HdfsUtils.java:82)
at eu.qualimaster.dataManagement.storage.hdfs.HdfsUtils.clearFolder(HdfsUtils.java:232)
at eu.qualimaster.coordination.RepositoryConnector.readModels(RepositoryConnector.java:537)
at eu.qualimaster.coordination.RepositoryConnector.initialize(RepositoryConnector.java:441)
at eu.qualimaster.coordination.RepositoryConnector.(RepositoryConnector.java:330)
at eu.qualimaster.coordination.CoordinationManager.start(CoordinationManager.java:417)
at eu.qualimaster.adaptation.platform.Main.startupPlatform(Main.java:72)
at eu.qualimaster.adaptation.platform.Main.main(Main.java:117)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 9 more

(I tried to run ./main.sh)

eichelbe · 2017-01-24T09:28:37Z

That's exactly what my last post was about. Therefore, I need a change done by root. And please do not use main.sh on your cluster. There the infrastructure is installed as a service. I'll let you know as soon as the classpath entry is clarified with Miroslav (or you may call him directly ;))

eichelbe · 2017-01-24T10:19:09Z

Ok, fixed thanks to Miroslav. Updated the Wiki in this respect (main.sh).

Started PriorityPip. Comes up and reports java.lang.NullPointerException at eu.qualimaster.algorithms.imp.correlation.SpringClient.getSpringStream(SpringClient.java:49) at eu.qualimaster.PriorityPip.topology.PriorityPip_Source0Source.nextTu

Here is the full trace
node14: java.lang.NullPointerException: null node14: at eu.qualimaster.algorithms.imp.correlation.SpringClient.getSpringStream(SpringClient.java:49) ~[stormjar.jar:na] node14: at eu.qualimaster.PriorityPip.topology.PriorityPip_Source0Source.nextTuple(PriorityPip_Source0Source.java:148) ~[stormjar.jar:na] node14: at backtype.storm.daemon.executor$fn__5886$fn__5901$fn__5930.invoke(executor.clj:759) ~[storm-core-0.9.5.jar:0.9.5] node14: at backtype.storm.util$async_loop$fn__565.invoke(util.clj:475) ~[storm-core-0.9.5.jar:0.9.5] node14: at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] node14: at java.lang.Thread.run(Thread.java:745) [na:1.8.0_73]

hdfs.path, hdfs.url are set up. Simulation settings as discussed in #59 so far not. Are they needed.

ChristophHubeL3S · 2017-01-24T10:27:52Z

Thanks for fixing! The FocusPip starts now, but immediately crashes because of the FinancialSource which is not a surprise given that the financial data is not yet reachable.

Btw, I would not use the PriorityPip for testing since it is very old and I think nobody is maintaining it.

eichelbe · 2017-01-24T10:31:06Z

Ok (see trace above), how to set up the financial simulation source? @ap0n

ChristophHubeL3S · 2017-01-24T10:32:01Z

Apostolos: we can add the financial dataset here: /local/home/storm/datasets

Then we just have have to adapt the Source.

antoine-tran · 2017-01-31T15:56:44Z

FYI, we just killed the pipeline TransferPip

antoine-tran · 2017-01-31T16:16:46Z

Btw, how can we see the status of a pipeline life cycle ? :)

eichelbe · 2017-01-31T16:46:25Z

Just through the infrastructure logs :o

eichelbe · 2017-01-31T16:46:59Z

We did not tweak the Storm UI... - or through QM-IConf (adaptation/runtime log). But therefore we also need to have the adaptation port available ;)

eichelbe · 2017-01-31T16:55:49Z

... as long as the infrastructure logs "Elevating to".. the respective phase is not reached. For both critical phases (CREATED and INITIALIZED) it logs also which nodes are expected and which are missing.

CREATED means that all nodes must have send an initial event via the event bus and that they are ready for receiving the initial algorithm assignment. If exceptions happen here, CREATED can easily be prevented. Then the adaptation issues commands/signals to set the initial algorithms.

INITIALIZED means that all nodes have received and confirmed their initial algorithm and are ready for processing.

If INITIALIZED is reached, the infrastructure connects data sinks and sources and processing shall start, i.e., it switches the pipeline to STARTED.

eichelbe · 2017-02-01T12:33:32Z

Hi, what about the running FocusPip? I would like to test the Load shedding (just enabled on PriorityFinancialPip) and need an infrastructure restart (local model)...

eichelbe · 2017-02-01T13:29:35Z

And from the FocusPip:

node19: 2017-02-01T12:40:26.879+0100 b.s.util [ERROR] Async loop died! (PipelineVar_7_Source1)
node19: java.lang.NullPointerException: null
node19: at eu.qualimaster.focus.FocusedSpringClientSimulator.getSpringStream(FocusedSpringClientSimulator.java:63) ~[stormjar.jar:na]
node19: at eu.qualimaster.FocusPip.topology.PipelineVar_7_Source1Source.nextTuple(PipelineVar_7_Source1Source.java:143) ~[stormjar.jar:na]
node19: at backtype.storm.daemon.executor$fn__5886$fn__5901$fn__5930.invoke(executor.clj:759) ~[storm-core-0.9.5.jar:0.9.5]
node19: at backtype.storm.util$async_loop$fn__565.invoke(util.clj:475) ~[storm-core-0.9.5.jar:0.9.5]
node19: at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
node19: at java.lang.Thread.run(Thread.java:745) [na:1.8.0_73]
node19: 2017-02-01T12:40:26.879+0100 b.s.d.executor [ERROR]
node19: java.lang.NullPointerException: null
node19: at eu.qualimaster.focus.FocusedSpringClientSimulator.getSpringStream(FocusedSpringClientSimulator.java:63) ~[stormjar.jar:na]
node19: at eu.qualimaster.FocusPip.topology.PipelineVar_7_Source1Source.nextTuple(PipelineVar_7_Source1Source.java:143) ~[stormjar.jar:na]
node19: at backtype.storm.daemon.executor$fn__5886$fn__5901$fn__5930.invoke(executor.clj:759) ~[storm-core-0.9.5.jar:0.9.5]
node19: at backtype.storm.util$async_loop$fn__565.invoke(util.clj:475) ~[storm-core-0.9.5.jar:0.9.5]
node19: at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
node19: at java.lang.Thread.run(Thread.java:745) [na:1.8.0_73]
node19: 2017-02-01T12:40:26.915+0100 b.s.util [ERROR] Halting process: ("Worker with id " "c1c4a871-f049-400f-9004-52e23454ec5e" " died")
node19: java.lang.RuntimeException: ("Worker with id " "c1c4a871-f049-400f-9004-52e23454ec5e" " died")
node19: at backtype.storm.util$exit_process_BANG_.doInvoke(util.clj:335) [storm-core-0.9.5.jar:0.9.5]
node19: at clojure.lang.RestFn.invoke(RestFn.java:460) [clojure-1.5.1.jar:na]
node19: at backtype.storm.daemon.worker$fn__6460$fn__6461.invoke(worker.clj:669) [storm-core-0.9.5.jar:0.9.5]
node19: at backtype.storm.daemon.executor$mk_executor_data$fn__5726$fn__5727.invoke(executor.clj:261) [storm-core-0.9.5.jar:0.9.5]
node19: at backtype.storm.util$async_loop$fn__565.invoke(util.clj:488) [storm-core-0.9.5.jar:0.9.5]
node19: at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
node19: at java.lang.Thread.run(Thread.java:745) [na:1.8.0_73]

after some Got playerList parameter. New value: addmarketPlayer/0,1,2,3,4,5...

eichelbe · 2017-02-01T13:29:41Z

Killing the FocusPip...

eichelbe · 2017-02-01T13:33:29Z

Changing financial data set to overload scenario 2...

antoine-tran · 2017-02-01T17:11:07Z

@eichelbe I'm testing the HBase connection and need to change and recompiled DataManageLayer, and manually copied into provided_libs directory in LUH cluster.

Do I have to re-run main.sh to reload the library? It seems the code is not updated when I restarted the pipelines ....

eichelbe · 2017-02-01T17:19:14Z

Definitely it is not updated based on the pipelines as not included (provided libs). As stated above, in your cluster running main.sh is not the right way as its execution is controlled by a service manager.

Ok, Jenkins is done. I'll do the update and the distribution and let you know...

eichelbe · 2017-02-01T17:19:43Z

What about the running transfer pip?

antoine-tran · 2017-02-01T17:20:52Z

no I didn't commit it yet. Just thought we could hot test it quickly via manual copies.
For the running pipeline you can delete it.

antoine-tran · 2017-02-01T17:21:41Z

Okay just committed now

eichelbe · 2017-02-01T17:25:02Z

Then it will take a while until we can update the infrastructure. Did not know this. Otherwise a patch and overriding the jar after a maven build would have worked...

eichelbe · 2017-02-01T17:52:11Z

Jenkins is ready. Distributing libraries and restarting infrastructure on LUH cluster.

eichelbe · 2017-02-01T17:52:44Z

model loaded, infrastructure is up.

antoine-tran · 2017-02-01T18:07:33Z

So, I hard coded the Hbase configuration instead of getting values from DataManagementConfiguration, just to see if the problem is about ZooKeeper or not. The good news now is we didn't get the "Connection refused" error, so as long as qm.infrastructure.cfg managed to be visible to the pipeline, I guess we are good to go.

For the moment, we have the following error:

2017-02-01T18:55:59.683+0100 e.q.Configuration [ERROR] While reading configuration file /var/nfs/qm/qm.infrastructure.cfg: /var/nfs/qm/qm.infrastructure.cfg (No such file or directory)

Not sure how this file is distributed to the infrastructure ...

eichelbe · 2017-02-02T08:51:51Z

Which component is emitting that error?

antoine-tran · 2017-02-02T10:01:33Z

I actually didn't know which component causing this - my guess is somewhere in the EventManager that tries to locate the path of qm.infrastructure.cfg (I wasn't deeply aware of this actually)

Here is an excerpt from the log when I ran FocusPip (so no ReplayMechanism around):

2017-02-02T10:55:51.916+0100 e.q.F.t.PipelineVar_7_Source1Source [INFO] Prepared--basesignalspout.... FocusPip/PipelineVar_7_Source1
2017-02-02T10:55:51.924+0100 e.q.e.EventManager [INFO] received
2017-02-02T10:55:51.925+0100 e.q.e.EventManager [INFO] received ForwardHandlerEvent clientId: 9be758a0d32e06b5:41991db3:159fe4064ed:-8000-2601486293531395 eventClass: eu.qualimaster.infrastructure.PipelineLifecycleEvent
2017-02-02T10:55:51.926+0100 e.q.e.EventManager [INFO] sending
2017-02-02T10:55:51.947+0100 e.q.e.EventManager [INFO] sending ForwardHandlerEvent clientId: 9be758a0d32e06b5:41991db3:159fe4064ed:-8000-2601486293531395 eventClass: eu.qualimaster.infrastructure.PipelineLifecycleEvent
2017-02-02T10:55:51.958+0100 e.q.e.EventManager [INFO] received ForwardHandlerEvent clientId: 9be758a0d32e06b5:41991db3:159fe4064ed:-8000-2601486293531395 eventClass: eu.qualimaster.dataManagement.events.ReferenceDataManagementEvent
2017-02-02T10:55:51.959+0100 e.q.e.EventManager [INFO] received ForwardHandlerEvent clientId: 9be758a0d32e06b5:41991db3:159fe4064ed:-8000-2601486293531395 eventClass: eu.qualimaster.dataManagement.events.ShutdownEvent
2017-02-02T10:55:51.968+0100 e.q.Configuration [ERROR] While reading configuration file /var/nfs/qm/qm.infrastructure.cfg: /var/nfs/qm/qm.infrastructure.cfg (No such file or directory)
2017-02-02T10:55:51.971+0100 e.q.e.EventManager [INFO] sending ForwardHandlerEvent clientId: 9be758a0d32e06b5:41991db3:159fe4064ed:-8000-2601486293531395 eventClass: eu.qualimaster.dataManagement.events.ReferenceDataManagementEvent
2017-02-02T10:55:51.989+0100 e.q.a.i.c.SpringClientSimulator [INFO] Using local FS for simulation data
2017-02-02T10:55:51.990+0100 e.q.a.i.c.SpringClientSimulator [INFO] Path to Symbollist.txt: /home/storm/data//Symbollist.txt
2017-02-02T10:55:51.992+0100 e.q.a.i.c.SpringClientSimulator [INFO] Path to data.txt: /home/storm/data//data.txt

eichelbe · 2017-02-02T10:32:46Z

The bolts and spouts are initializing the event manager as needed and this happens before the shown trace based on Storm conf information. They do not rely on that path. So something else may call that, as far as I remember some TSI components are doing this with fixed path as fallback, but this is just a guess. Of course, we may emit the stacktrace to find the cause, but this may be as confusing as the sinks indicating with a stack trace that the external service is not reachable.

More importantly, do you get the correct hbase setup information from DML cfg or not without trying to initialize it manually... and when do you try to get this information. Please note that the correct information is available earliest after the super.prepare(...) call, not in any constructor (which is anyway the wrong place to do heavy-weight initializations here)).

eichelbe · 2017-02-03T09:05:55Z

I will do some re-parallelization tests on LUH cluster now. May cause strange effects in other pipelines on correlation computation. It's probably better to not do further tests at the moment, probably until lunch. I'll let you know.

antoine-tran · 2017-02-03T13:13:51Z

I see that now the cluster is free. Could we run some tests now ?

In addition, we constantly made some changes now in DML to spot the issues (just commenting / un-commenting some pieces of code to inspect). How can we restart the platform conveniently instead of emailing around?

eichelbe · 2017-02-03T13:27:59Z

Partially. try running your tests now. the effects shall now be rather local on PrioFinPip.

If you take the responsibility of fixing problems from an update until the infra is running again, please feel free to read the wiki and to use the respective scripts.

eichelbe · 2017-02-03T15:16:44Z

Is the TransferPip still needed? Can I go on?

antoine-tran · 2017-02-03T15:38:26Z

Hi Holger, yes you can kill the pipelines and proceed now. We are scrutinizing locally now

eichelbe · 2017-02-06T09:48:02Z

Hi, anyone using the LUH cluster by now. I would like to update/restart it and to give the TimeTravelPip another try (just having an idea where the strange exception could come from).

eichelbe · 2017-02-06T09:55:30Z

Update/restart done...

antoine-tran · 2017-02-06T16:09:24Z

@eichelbe Hi Holger, could you tell us where to see the infrastructure log of the currently running TransferPip ?
Also, how can one restart the infrastructure, using the same account (storm) in hadoop2 ?

eichelbe · 2017-02-06T17:25:15Z

Hi Tuan,
there is only a single infrastructure log (log-rotated) at /var/log/storm/qm-infra/current. If required, the commands for restarting the infrastructure are QM_INFRA_DOWN and then QM_INFRA_UP (see Wiki). Please observe the infrastructure log until the infrastructure is completely started!

eichelbe assigned cuiqin, xander7, L3SQualimaster and tahoangl3s Jan 18, 2017

ap0n mentioned this issue Jan 18, 2017

Time-travel pipeline #59

Open

Run pipelines on LUH cluster #62

Run pipelines on LUH cluster #62

Comments

eichelbe commented Jan 18, 2017 • edited Loading

eichelbe commented Jan 18, 2017 • edited Loading

eichelbe commented Jan 18, 2017

smutahoang commented Jan 18, 2017

eichelbe commented Jan 18, 2017

ap0n commented Jan 18, 2017

eichelbe commented Jan 18, 2017

cuiqin commented Jan 18, 2017

eichelbe commented Jan 19, 2017

eichelbe commented Jan 19, 2017

ap0n commented Jan 19, 2017

L3SQualimaster commented Jan 19, 2017 via email

eichelbe commented Jan 19, 2017 • edited Loading

ap0n commented Jan 19, 2017

eichelbe commented Jan 20, 2017

eichelbe commented Jan 23, 2017

ap0n commented Jan 23, 2017

eichelbe commented Jan 23, 2017

eichelbe commented Jan 23, 2017

eichelbe commented Jan 23, 2017

ChristophHubeL3S commented Jan 24, 2017 • edited Loading

eichelbe commented Jan 24, 2017

eichelbe commented Jan 24, 2017 • edited Loading

ChristophHubeL3S commented Jan 24, 2017

eichelbe commented Jan 24, 2017

ChristophHubeL3S commented Jan 24, 2017

antoine-tran commented Jan 31, 2017

antoine-tran commented Jan 31, 2017

eichelbe commented Jan 31, 2017

eichelbe commented Jan 31, 2017

eichelbe commented Jan 31, 2017

eichelbe commented Feb 1, 2017

eichelbe commented Feb 1, 2017

eichelbe commented Feb 1, 2017

eichelbe commented Feb 1, 2017

antoine-tran commented Feb 1, 2017

eichelbe commented Feb 1, 2017

eichelbe commented Feb 1, 2017

antoine-tran commented Feb 1, 2017

antoine-tran commented Feb 1, 2017

eichelbe commented Feb 1, 2017

eichelbe commented Feb 1, 2017

eichelbe commented Feb 1, 2017

antoine-tran commented Feb 1, 2017

eichelbe commented Feb 2, 2017

antoine-tran commented Feb 2, 2017

eichelbe commented Feb 2, 2017

eichelbe commented Feb 3, 2017

antoine-tran commented Feb 3, 2017

eichelbe commented Feb 3, 2017

eichelbe commented Feb 3, 2017

antoine-tran commented Feb 3, 2017

eichelbe commented Feb 6, 2017

eichelbe commented Feb 6, 2017

antoine-tran commented Feb 6, 2017

eichelbe commented Feb 6, 2017

eichelbe commented Jan 18, 2017 •

edited

Loading

eichelbe commented Jan 18, 2017 •

edited

Loading

eichelbe commented Jan 19, 2017 •

edited

Loading

ChristophHubeL3S commented Jan 24, 2017 •

edited

Loading

eichelbe commented Jan 24, 2017 •

edited

Loading