Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run pipelines on LUH cluster #62

Open
eichelbe opened this issue Jan 18, 2017 · 173 comments
Open

Run pipelines on LUH cluster #62

eichelbe opened this issue Jan 18, 2017 · 173 comments
Assignees

Comments

@eichelbe
Copy link
Collaborator

eichelbe commented Jan 18, 2017

For testing and backup-plan
(1) Run RandomPip on LUH cluster (SUH)
(2) Replace storm to Adaptive Storm* and change worker configuration for larger pipelines (SUH)
(3) Run RandomPip on LUH cluster again (SUH)
(4) Run TransferPip on LUH cluster (LUH)
(5) Run FocusPip on LUH cluster (LUH)
(6) For TSI: Run Time Travel Pipeline for startup time debugging on LUH cluster (SUH + TSI)
(7) Rund TransferPip + FocusPip on LUH cluster
(8) Test QM-IConf connection into LUH cluster (requires SSH tunneling)
(9) Test Application connection into LUH cluster (requires SSH tunneling)

Please document process below!
After (3) we will communicate how to work with the cluster on the project Wiki!

*required for adaptation and for identifying timing issues (extension over storm, can be further extended if needed)

@eichelbe
Copy link
Collaborator Author

eichelbe commented Jan 18, 2017

Cui is working on (1)

  • QM infrastructure setup seems to be ok, however workers are restarting even on RandomPip
  • Seems to be a storm serialization problem. Local storm app folders not cleared! Need to restart all workers.
  • Needed to replace Storm (Adaptive Storm is installed) as well as the infrastructure installation.

@eichelbe
Copy link
Collaborator Author

-> RandomPip is running
(1)+(2)+(3) are done

@smutahoang
Copy link

@eichelbe @cuiqin That's great! Thanks for your effort. We look forward for the wiki page on how to run the pipelines on LUH clusters so that we can proceed with the next steps.

@eichelbe
Copy link
Collaborator Author

;) It's also in our interest....

And here is the extended LUH cluster wikipage (the lower parts are mostly "dangerous"...)

@ap0n
Copy link
Collaborator

ap0n commented Jan 18, 2017

Great!
So, how do we continue? Do we get accounts on LUH cluster or do we coordinate with SUH and test together (mainly interested in (6) 👼)?

@eichelbe
Copy link
Collaborator Author

Not sure, this is LUH decision. We had to sign some papers that we loose all our money if we download their data ;) Just kidding, I think we will take your pipeline and run it there for you reporting on the issues that we see.

@cuiqin
Copy link
Collaborator

cuiqin commented Jan 18, 2017

Adaptive switch over RandomPip is working.

@eichelbe
Copy link
Collaborator Author

As a follow-up - LUH infrastructure is now working with coordination.commandCompletion.onEvent = true. Will become active with next restart.

@eichelbe
Copy link
Collaborator Author

For (4)+(5) the financial data source may need an adjustment to read from the file system rather than HDFS (not available in LUH cluster). Either there is a way to convince Miroslav or we have to modify the TSI source. Initial discussion with Christoph regarding the Twitter source (works with file system) are ongoing to synchronize the work and the approach...

@ap0n
Copy link
Collaborator

ap0n commented Jan 19, 2017

But can't the Okeanos HDFS be also used? (It will be slower because of the internet but it could be an alternative at least for testing...)

@L3SQualimaster
Copy link

L3SQualimaster commented Jan 19, 2017 via email

@eichelbe
Copy link
Collaborator Author

eichelbe commented Jan 19, 2017

Cool. From our meeting I had in mind that we do not have access to HDFS. Then let's keep things simple, put data under /data/storm in HDFS and configure the infrastructure respectively. I will tell TSI...

@ap0n
Copy link
Collaborator

ap0n commented Jan 19, 2017

Just to let you know, today I spotted another attack at our HDFS; this time data ransom was included! According to this, these attacks are quite widespread.

@eichelbe
Copy link
Collaborator Author

How bad, as the cluster of LUH is behind an external server, there the risk shall be lower.

@L3SQualimaster any news on the FocusPip?

@eichelbe
Copy link
Collaborator Author

Infrastructure setup has been changed to take up the LUH hdfs with base path /data/storm as suggested by Miroslav (thanks again). The respective setup information is being passed to the worker by the infrastructure via the DML configuration class (getHdfsUrl(), getHdfsPath() as postfix to the HDFS URL, getSimulationLocalPath(), useSimulationHdfs() as well as getDfsPath()). Infrastructure update and restart of infrastructure are needed.

It's now up to the sources to take up this information and to copy the required data into the HDFS. How is the time plan there?

@ap0n
Copy link
Collaborator

ap0n commented Jan 23, 2017

SpringClientSimulator is modified to use these settings (committed about half an hour ago)

@eichelbe
Copy link
Collaborator Author

Fine :)

@eichelbe
Copy link
Collaborator Author

BTW, the HDFS user/group setup is not done, so the infrastructure may complain about that while extracting the setup files for Stefan. But as far as I know, Tuan is trying to find a way how the Stakeholder applications could access the pipeline results, then we can also figure out whether extracting the setup files also holds/is needed in the LUH setup.

@eichelbe
Copy link
Collaborator Author

... the installation needs a further classpath entry for HDFS. Discussing with Miroslav...

@ChristophHubeL3S
Copy link

ChristophHubeL3S commented Jan 24, 2017

Hi everyone, I got an error while trying to start the infrastructure on hadoop2:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
at eu.qualimaster.dataManagement.storage.hdfs.HdfsUtils.getFilesystem(HdfsUtils.java:97)
at eu.qualimaster.dataManagement.storage.hdfs.HdfsUtils.getFilesystem(HdfsUtils.java:82)
at eu.qualimaster.dataManagement.storage.hdfs.HdfsUtils.clearFolder(HdfsUtils.java:232)
at eu.qualimaster.coordination.RepositoryConnector.readModels(RepositoryConnector.java:537)
at eu.qualimaster.coordination.RepositoryConnector.initialize(RepositoryConnector.java:441)
at eu.qualimaster.coordination.RepositoryConnector.(RepositoryConnector.java:330)
at eu.qualimaster.coordination.CoordinationManager.start(CoordinationManager.java:417)
at eu.qualimaster.adaptation.platform.Main.startupPlatform(Main.java:72)
at eu.qualimaster.adaptation.platform.Main.main(Main.java:117)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 9 more

(I tried to run ./main.sh)

@eichelbe
Copy link
Collaborator Author

That's exactly what my last post was about. Therefore, I need a change done by root. And please do not use main.sh on your cluster. There the infrastructure is installed as a service. I'll let you know as soon as the classpath entry is clarified with Miroslav (or you may call him directly ;))

@eichelbe
Copy link
Collaborator Author

eichelbe commented Jan 24, 2017

Ok, fixed thanks to Miroslav. Updated the Wiki in this respect (main.sh).

Started PriorityPip. Comes up and reports java.lang.NullPointerException at eu.qualimaster.algorithms.imp.correlation.SpringClient.getSpringStream(SpringClient.java:49) at eu.qualimaster.PriorityPip.topology.PriorityPip_Source0Source.nextTu

Here is the full trace
node14: java.lang.NullPointerException: null node14: at eu.qualimaster.algorithms.imp.correlation.SpringClient.getSpringStream(SpringClient.java:49) ~[stormjar.jar:na] node14: at eu.qualimaster.PriorityPip.topology.PriorityPip_Source0Source.nextTuple(PriorityPip_Source0Source.java:148) ~[stormjar.jar:na] node14: at backtype.storm.daemon.executor$fn__5886$fn__5901$fn__5930.invoke(executor.clj:759) ~[storm-core-0.9.5.jar:0.9.5] node14: at backtype.storm.util$async_loop$fn__565.invoke(util.clj:475) ~[storm-core-0.9.5.jar:0.9.5] node14: at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] node14: at java.lang.Thread.run(Thread.java:745) [na:1.8.0_73]

hdfs.path, hdfs.url are set up. Simulation settings as discussed in #59 so far not. Are they needed.

@ChristophHubeL3S
Copy link

Thanks for fixing! The FocusPip starts now, but immediately crashes because of the FinancialSource which is not a surprise given that the financial data is not yet reachable.

Btw, I would not use the PriorityPip for testing since it is very old and I think nobody is maintaining it.

@eichelbe
Copy link
Collaborator Author

Ok (see trace above), how to set up the financial simulation source? @ap0n

@ChristophHubeL3S
Copy link

Apostolos: we can add the financial dataset here: /local/home/storm/datasets

Then we just have have to adapt the Source.

@antoine-tran
Copy link

FYI, we just killed the pipeline TransferPip

@antoine-tran
Copy link

Btw, how can we see the status of a pipeline life cycle ? :)

@eichelbe
Copy link
Collaborator Author

Just through the infrastructure logs :o

@eichelbe
Copy link
Collaborator Author

We did not tweak the Storm UI... - or through QM-IConf (adaptation/runtime log). But therefore we also need to have the adaptation port available ;)

@eichelbe
Copy link
Collaborator Author

... as long as the infrastructure logs "Elevating to".. the respective phase is not reached. For both critical phases (CREATED and INITIALIZED) it logs also which nodes are expected and which are missing.

CREATED means that all nodes must have send an initial event via the event bus and that they are ready for receiving the initial algorithm assignment. If exceptions happen here, CREATED can easily be prevented. Then the adaptation issues commands/signals to set the initial algorithms.

INITIALIZED means that all nodes have received and confirmed their initial algorithm and are ready for processing.

If INITIALIZED is reached, the infrastructure connects data sinks and sources and processing shall start, i.e., it switches the pipeline to STARTED.

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 1, 2017

Hi, what about the running FocusPip? I would like to test the Load shedding (just enabled on PriorityFinancialPip) and need an infrastructure restart (local model)...

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 1, 2017

And from the FocusPip:

node19: 2017-02-01T12:40:26.879+0100 b.s.util [ERROR] Async loop died! (PipelineVar_7_Source1)
node19: java.lang.NullPointerException: null
node19: at eu.qualimaster.focus.FocusedSpringClientSimulator.getSpringStream(FocusedSpringClientSimulator.java:63) ~[stormjar.jar:na]
node19: at eu.qualimaster.FocusPip.topology.PipelineVar_7_Source1Source.nextTuple(PipelineVar_7_Source1Source.java:143) ~[stormjar.jar:na]
node19: at backtype.storm.daemon.executor$fn__5886$fn__5901$fn__5930.invoke(executor.clj:759) ~[storm-core-0.9.5.jar:0.9.5]
node19: at backtype.storm.util$async_loop$fn__565.invoke(util.clj:475) ~[storm-core-0.9.5.jar:0.9.5]
node19: at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
node19: at java.lang.Thread.run(Thread.java:745) [na:1.8.0_73]
node19: 2017-02-01T12:40:26.879+0100 b.s.d.executor [ERROR]
node19: java.lang.NullPointerException: null
node19: at eu.qualimaster.focus.FocusedSpringClientSimulator.getSpringStream(FocusedSpringClientSimulator.java:63) ~[stormjar.jar:na]
node19: at eu.qualimaster.FocusPip.topology.PipelineVar_7_Source1Source.nextTuple(PipelineVar_7_Source1Source.java:143) ~[stormjar.jar:na]
node19: at backtype.storm.daemon.executor$fn__5886$fn__5901$fn__5930.invoke(executor.clj:759) ~[storm-core-0.9.5.jar:0.9.5]
node19: at backtype.storm.util$async_loop$fn__565.invoke(util.clj:475) ~[storm-core-0.9.5.jar:0.9.5]
node19: at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
node19: at java.lang.Thread.run(Thread.java:745) [na:1.8.0_73]
node19: 2017-02-01T12:40:26.915+0100 b.s.util [ERROR] Halting process: ("Worker with id " "c1c4a871-f049-400f-9004-52e23454ec5e" " died")
node19: java.lang.RuntimeException: ("Worker with id " "c1c4a871-f049-400f-9004-52e23454ec5e" " died")
node19: at backtype.storm.util$exit_process_BANG_.doInvoke(util.clj:335) [storm-core-0.9.5.jar:0.9.5]
node19: at clojure.lang.RestFn.invoke(RestFn.java:460) [clojure-1.5.1.jar:na]
node19: at backtype.storm.daemon.worker$fn__6460$fn__6461.invoke(worker.clj:669) [storm-core-0.9.5.jar:0.9.5]
node19: at backtype.storm.daemon.executor$mk_executor_data$fn__5726$fn__5727.invoke(executor.clj:261) [storm-core-0.9.5.jar:0.9.5]
node19: at backtype.storm.util$async_loop$fn__565.invoke(util.clj:488) [storm-core-0.9.5.jar:0.9.5]
node19: at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
node19: at java.lang.Thread.run(Thread.java:745) [na:1.8.0_73]

after some Got playerList parameter. New value: addmarketPlayer/0,1,2,3,4,5...

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 1, 2017

Killing the FocusPip...

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 1, 2017

Changing financial data set to overload scenario 2...

@antoine-tran
Copy link

@eichelbe I'm testing the HBase connection and need to change and recompiled DataManageLayer, and manually copied into provided_libs directory in LUH cluster.

Do I have to re-run main.sh to reload the library? It seems the code is not updated when I restarted the pipelines ....

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 1, 2017

Definitely it is not updated based on the pipelines as not included (provided libs). As stated above, in your cluster running main.sh is not the right way as its execution is controlled by a service manager.

Ok, Jenkins is done. I'll do the update and the distribution and let you know...

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 1, 2017

What about the running transfer pip?

@antoine-tran
Copy link

no I didn't commit it yet. Just thought we could hot test it quickly via manual copies.
For the running pipeline you can delete it.

@antoine-tran
Copy link

Okay just committed now

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 1, 2017

Then it will take a while until we can update the infrastructure. Did not know this. Otherwise a patch and overriding the jar after a maven build would have worked...

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 1, 2017

Jenkins is ready. Distributing libraries and restarting infrastructure on LUH cluster.

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 1, 2017

model loaded, infrastructure is up.

@antoine-tran
Copy link

So, I hard coded the Hbase configuration instead of getting values from DataManagementConfiguration, just to see if the problem is about ZooKeeper or not. The good news now is we didn't get the "Connection refused" error, so as long as qm.infrastructure.cfg managed to be visible to the pipeline, I guess we are good to go.

For the moment, we have the following error:

2017-02-01T18:55:59.683+0100 e.q.Configuration [ERROR] While reading configuration file /var/nfs/qm/qm.infrastructure.cfg: /var/nfs/qm/qm.infrastructure.cfg (No such file or directory)

Not sure how this file is distributed to the infrastructure ...

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 2, 2017

Which component is emitting that error?

@antoine-tran
Copy link

I actually didn't know which component causing this - my guess is somewhere in the EventManager that tries to locate the path of qm.infrastructure.cfg (I wasn't deeply aware of this actually)

Here is an excerpt from the log when I ran FocusPip (so no ReplayMechanism around):

2017-02-02T10:55:51.916+0100 e.q.F.t.PipelineVar_7_Source1Source [INFO] Prepared--basesignalspout.... FocusPip/PipelineVar_7_Source1
2017-02-02T10:55:51.924+0100 e.q.e.EventManager [INFO] received
2017-02-02T10:55:51.925+0100 e.q.e.EventManager [INFO] received ForwardHandlerEvent clientId: 9be758a0d32e06b5:41991db3:159fe4064ed:-8000-2601486293531395 eventClass: eu.qualimaster.infrastructure.PipelineLifecycleEvent
2017-02-02T10:55:51.926+0100 e.q.e.EventManager [INFO] sending
2017-02-02T10:55:51.947+0100 e.q.e.EventManager [INFO] sending ForwardHandlerEvent clientId: 9be758a0d32e06b5:41991db3:159fe4064ed:-8000-2601486293531395 eventClass: eu.qualimaster.infrastructure.PipelineLifecycleEvent
2017-02-02T10:55:51.958+0100 e.q.e.EventManager [INFO] received ForwardHandlerEvent clientId: 9be758a0d32e06b5:41991db3:159fe4064ed:-8000-2601486293531395 eventClass: eu.qualimaster.dataManagement.events.ReferenceDataManagementEvent
2017-02-02T10:55:51.959+0100 e.q.e.EventManager [INFO] received ForwardHandlerEvent clientId: 9be758a0d32e06b5:41991db3:159fe4064ed:-8000-2601486293531395 eventClass: eu.qualimaster.dataManagement.events.ShutdownEvent
2017-02-02T10:55:51.968+0100 e.q.Configuration [ERROR] While reading configuration file /var/nfs/qm/qm.infrastructure.cfg: /var/nfs/qm/qm.infrastructure.cfg (No such file or directory)
2017-02-02T10:55:51.971+0100 e.q.e.EventManager [INFO] sending ForwardHandlerEvent clientId: 9be758a0d32e06b5:41991db3:159fe4064ed:-8000-2601486293531395 eventClass: eu.qualimaster.dataManagement.events.ReferenceDataManagementEvent
2017-02-02T10:55:51.989+0100 e.q.a.i.c.SpringClientSimulator [INFO] Using local FS for simulation data
2017-02-02T10:55:51.990+0100 e.q.a.i.c.SpringClientSimulator [INFO] Path to Symbollist.txt: /home/storm/data//Symbollist.txt
2017-02-02T10:55:51.992+0100 e.q.a.i.c.SpringClientSimulator [INFO] Path to data.txt: /home/storm/data//data.txt

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 2, 2017

The bolts and spouts are initializing the event manager as needed and this happens before the shown trace based on Storm conf information. They do not rely on that path. So something else may call that, as far as I remember some TSI components are doing this with fixed path as fallback, but this is just a guess. Of course, we may emit the stacktrace to find the cause, but this may be as confusing as the sinks indicating with a stack trace that the external service is not reachable.

More importantly, do you get the correct hbase setup information from DML cfg or not without trying to initialize it manually... and when do you try to get this information. Please note that the correct information is available earliest after the super.prepare(...) call, not in any constructor (which is anyway the wrong place to do heavy-weight initializations here)).

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 3, 2017

I will do some re-parallelization tests on LUH cluster now. May cause strange effects in other pipelines on correlation computation. It's probably better to not do further tests at the moment, probably until lunch. I'll let you know.

@antoine-tran
Copy link

I see that now the cluster is free. Could we run some tests now ?

In addition, we constantly made some changes now in DML to spot the issues (just commenting / un-commenting some pieces of code to inspect). How can we restart the platform conveniently instead of emailing around?

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 3, 2017

Partially. try running your tests now. the effects shall now be rather local on PrioFinPip.

If you take the responsibility of fixing problems from an update until the infra is running again, please feel free to read the wiki and to use the respective scripts.

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 3, 2017

Is the TransferPip still needed? Can I go on?

@antoine-tran
Copy link

Hi Holger, yes you can kill the pipelines and proceed now. We are scrutinizing locally now

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 6, 2017

Hi, anyone using the LUH cluster by now. I would like to update/restart it and to give the TimeTravelPip another try (just having an idea where the strange exception could come from).

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 6, 2017

Update/restart done...

@antoine-tran
Copy link

@eichelbe Hi Holger, could you tell us where to see the infrastructure log of the currently running TransferPip ?
Also, how can one restart the infrastructure, using the same account (storm) in hadoop2 ?

@eichelbe
Copy link
Collaborator Author

eichelbe commented Feb 6, 2017

Hi Tuan,
there is only a single infrastructure log (log-rotated) at /var/log/storm/qm-infra/current. If required, the commands for restarting the infrastructure are QM_INFRA_DOWN and then QM_INFRA_UP (see Wiki). Please observe the infrastructure log until the infrastructure is completely started!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests