Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon doesn't work in remote machine #157

Closed
soloman817 opened this issue Feb 22, 2016 · 2 comments
Closed

daemon doesn't work in remote machine #157

soloman817 opened this issue Feb 22, 2016 · 2 comments

Comments

@soloman817
Copy link

Hi,

I did the following:

  1. I built Prajna with build.cmd R from the master branch source code;
  2. I copied the client folder to two machines
  3. I deleted the folder C:\Prajna on both machines
  4. I turned off the Windows Firewall completely on both machines
  5. I started client without any options (so it will work on default port 1082)

Then I simply want to call this from remote:

        private static void SayHello(Cluster cluster)
        {
            var dset = new DSet<int> { Name = Guid.NewGuid().ToString("D"), Cluster = cluster };
            var descriptions =
                dset
                .Distribute(Enumerable.Range(0, cluster.NumNodes))
                .Select(i =>
                {
                    var gpuId = Int32.Parse(ConfigurationManager.AppSettings["GpuId"]);
                    var machineName = System.Environment.MachineName;
                    var process = System.Diagnostics.Process.GetCurrentProcess();
                    var gpu = Gpu.Get(gpuId);
                    return $"Hello from {machineName} {gpu} taskId={i} processId={process.Id} threadId={Thread.CurrentThread.ManagedThreadId}";
                })
                .ToIEnumerable()
                .ToArray();
            foreach (var description in descriptions)
            {
                Console.WriteLine(description);
            }
        }

The test result is like this:

If I use the following cluster.lst, then it WORKS:

XiangCluster,1082
localhost,1082

Also, if I use real IP, it also works (I launch the application from the same machine):

XiangCluster,1082
192.168.1.110,1082

Then if I want to add a remote machine, like:

XiangCluster,1082
192.168.1.110,1082
192.168.1.108,1082

Then it DOESN'T WORK anymore.

I checked the log of daemon on 192.168.1.110, I found something like:

============== New Log File ======================= 
160222_020627.133310,1,Info,PrajnaMachineId is 290efbd143477d11
160222_020627.173490,1,Info,Initialize network stack with initial buffers: 128 max buffers: 33554 buffer size: 128000 network threads: 2
160222_020627.215722,1,Info,Start PrajnaClient at port 1082 (1100-1150)...................... Mode x64, 1 MB
160222_020627.218012,1,Info,Minimum threads: 16, Minimum I/O completion threads: 4
160222_020627.218622,1,Info,Maximum threads: 32767, Maximum I/O completion threads: 1000
160222_020627.219319,1,Info,Available threads: 32767, Available I/O completion threads: 1000
160222_020627.220786,1,Info,Start Parameters [||]
160222_020627.228628,1,Info,All command parsed ==== true
160222_020627.261606,1,Info,Authentication parameters: pwd=empty keyfile= keyfilepwd=empty
160222_020709.983452,18,Info,GetDriveSpace, fail to retrieve remote storage information for machine 192.168.1.108, with exception System.Management.ManagementException: Access denied 
   at System.Management.ManagementException.ThrowWithExtendedInfo(ManagementStatus errorCode)
   at System.Management.ManagementScope.InitializeGuts(Object o)
   at System.Management.ManagementScope.Initialize()
   at System.Management.ManagementObjectSearcher.Initialize()
   at System.Management.ManagementObjectSearcher.Get()
   at Prajna.Core.RemoteConfig.GetDriveSpace(String machineName)
160222_020744.693316,16,Error,Prajna.Core.Task.ErrorInSeparateApp : (Close,Job) Failed to find Job Action object for Job a6dfc439-1db5-41f5-9843-569a50737867, error has happened before? 

BTW, when I use the Prajna from the NuGet package, it works.

@soloman817
Copy link
Author

Update: I managed to created a domain joined network. I added a Windows Server 2012r2 as domain controller, and then I run the test again, but this time, I got another error:

Prajna init...
Prajna init done.
Cluster.NumNodes = 2

Unhandled Exception: System.AggregateException: One or more errors occurred. ---> System.Runtime.Remoting.RemotingException: ParseTaskCommandAtDaemon:Membership list of peer 0 is larger than 0, this P2P path hasn't been implemented yet
   --- End of inner exception stack trace ---
   at Prajna.Core.SingleJobActionGeneric`1.get_IsCancelledAndThrow()
   at <StartupCode$Prajna>.$Job.TrySyncMetaDataHost@2478.Invoke(SingleJobActionApp jobAction)
   at Microsoft.FSharp.Core.Operators.Using[T,TResult](T resource, FSharpFunc`2 action)
   at Prajna.Core.Job.ReadyMetaData()
   at Prajna.Core.Job.Ready()
   at Prajna.Core.Job.PrepareRemoteExecutionRosterOnce(JobDependencies curJob, Job srcJob)
   at Prajna.Core.Job.PrepareRemoteExecutionRoster(JobDependencies curJob, Job srcJob)
   at Prajna.Core.Job.PrepareMetaData()
   at Prajna.Core.Job.ReadyMetaData()
   at Prajna.Core.Job.Ready()
   at Prajna.Core.DSetAction.RetrieveMetaData()
   at Prajna.Core.DSetAction.BaseBeginAction(TaskLaunchMode nLaunchNewTaskMode)
   at Prajna.Core.DSetTaskRead.RetrieveMetaData()
   at Prajna.Core.DSetEnumerator`1.DoMoveNext()
   at System.Linq.Buffer`1..ctor(IEnumerable`1 source)
   at System.Linq.Enumerable.ToArray[TSource](IEnumerable`1 source)
   at PrajnaTest.CS.PiEstimation.SayHello(Cluster cluster) in C:\Users\solom\Documents\Projects\AleaGPU\tools\PrajnaTest.CS\PiEstimation.cs:line 76
   at PrajnaTest.CS.PiEstimation.Main() in C:\Users\solom\Documents\Projects\AleaGPU\tools\PrajnaTest.CS\PiEstimation.cs:line 169

Any idea on how to let it work on two machines through network?

@soloman817
Copy link
Author

This issue is because of you changed the cluster list file without changing the cluster name, for more detail, please see here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant