Simulating partially synchronous n/w #1026

zalokhan · 2017-11-30T01:39:56Z

Failure detector simulating partially synchronous network.
Added healing detector which polls and detects healed nodes.

Partially Synchronous Network.

A completely synchronous network would guarantee message delivery in a limited time frame. This would be ideal to detect failures if we fail to receive a response in the determined period.

Since this is not practical, we can have delayed responses from slow servers. It would not be right to
mark a slow node as failed just because it has delayed responses.

New Fault Detector.

The new fault detector handles this in the following manner.

The fault detector regularly pings all the alive nodes.
When it encounters a ping failure, instead of marking it as failed, the response timeout for all the nodes is increased. This is to accommodate any slowness in the node.
It then continues increasing this timeout until a particular threshold to see if this unresponsive node ever responds to the pings.
If a response is received it continues to operate with the increased response timeout without reporting any failures. (We scale back the timeout regularly to see if the slowness has disappeared)
If no responses are received, the node is marked as failed which triggers a reconfiguration event to remove this faulty node.

Healing Detector.

The healing detector operates only on the unresponsiveServer list.
This detector operates with the increased response timeout. This is to allow slow nodes to recover and heal.
When a node starts responding to pings, the detector generates a report which triggers a reconfiguration event to add this node back to the layout.

coveralls · 2017-11-30T02:04:25Z

Coverage increased (+0.04%) to 70.673% when pulling fe856fb on zalokhan:newDetector into 479c1a6 on CorfuDB:master.

coveralls · 2017-11-30T18:11:13Z

Coverage increased (+0.03%) to 70.767% when pulling a1747d7 on zalokhan:newDetector into 403a8d9 on CorfuDB:master.

corfudb-performance · 2017-12-01T14:28:29Z

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit a1747d7.

*** 0.03333333333333333% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.7666666666666667% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 1.3666666666666667% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.03333333333333333% transaction FAILURE rate for NonConflictingTx+Iterator workload, 1 threads, Disk mode
*** 0.35833333333333334% transaction FAILURE rate for NonConflictingTx+Iterator workload, 5 threads, Disk mode
*** 2.9916666666666667% transaction FAILURE rate for NonConflictingTx+Iterator workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1026 Graphs

corfudb-performance · 2017-12-05T05:16:53Z

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit d11f306.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1026 Graphs

no2chem · 2017-12-04T17:55:44Z

infrastructure/src/main/java/org/corfudb/infrastructure/ManagementServer.java

+                                .get();
+                        log.info("Healing nodes successful: {}", pollReport);
+                    } catch (InterruptedException | ExecutionException e) {
+                        log.error("Healing nodes failed: ", e);


Rethrow the interrupted exception

corfudb-performance · 2017-12-06T02:16:45Z

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit 1267fb5.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1026 Graphs

no2chem

This looks okay stylistically.

However, it's hard to determine what the overall design is. What is partially synchronous? Could you describe what kind of failures we can detect (and which ones we can't?) Can you also describe what conditions we can heal from? You can add this to the first comment (PR description).

zalokhan · 2017-12-07T00:01:48Z

@no2chem I have added some explanation in the PR description. More detailed explanation is present in the javadocs. Let me know if this is still unclear.

corfudb-performance · 2017-12-08T02:15:40Z

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit 653298c.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1026 Graphs

coveralls · 2017-12-09T09:07:32Z

Coverage increased (+0.3%) to 70.903% when pulling c5354f8 on zalokhan:newDetector into 0174570 on CorfuDB:master.

corfudb-performance · 2017-12-09T10:38:04Z

Results automatically generated by CorfuDB Benchmark Framework to assess the performance of this pull request for commit c5354f8.

*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx+Scan workload, 10 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 1 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 5 threads, Disk mode
*** 0.0% transaction FAILURE rate for NonConflictingTx workload, 10 threads, Disk mode

An interactive dashboard with Pull Request Performance Metrics for ALL cluster types and numbers of threads in run, is available at:
Pull Request #1026 Graphs

no2chem

LGTM

coveralls · 2017-12-11T22:12:33Z

Coverage increased (+0.2%) to 70.893% when pulling f9f22cf on zalokhan:newDetector into bcca6e2 on CorfuDB:master.

medhavidhawan · 2017-12-08T19:07:09Z

infrastructure/src/main/java/org/corfudb/infrastructure/management/IDetector.java

+
+    /**
+     * Executes the detector which runs the failure or healing detecting algorithm.
+     * Gets the polling report from the execution of the detector.


r/polling/poll

medhavidhawan · 2017-12-08T19:08:40Z

infrastructure/src/main/java/org/corfudb/infrastructure/management/PollReport.java

 import lombok.Data;

 /**
+ * Poll Report generated by the polling detectors.


Poll report generated by detectors that poll to detect failed or healed nodes.

medhavidhawan · 2017-12-08T19:12:33Z

...ructure/src/main/java/org/corfudb/infrastructure/management/ReconfigurationEventHandler.java

@@ -5,7 +5,7 @@
 import lombok.extern.slf4j.Slf4j;

 import org.corfudb.runtime.CorfuRuntime;
-import org.corfudb.runtime.view.IFailureHandlerPolicy;
+import org.corfudb.runtime.view.IReconfigurationHandlerPolicy;


r/or policy detecting a failure in the cluster/or policy detecting a failure or healing in the cluster

r/Handle healing: Handles healing of responsive nodes./Handle healing: Handles healing of unresponsive nodes.

r/Handle healing: Handles healing of responsive nodes./Handle healing: Handles healing of unresponsive nodes.
We actually heal responsive nodes. Shouldn't it be the way it is?
I can change it to:
Handle healing: Handles healing of unresponsive marked nodes which are now responsive.

yes that is better.

medhavidhawan · 2017-12-08T19:35:55Z

runtime/src/main/java/org/corfudb/runtime/view/NoLogUnitHealingPolicy.java

+ *
+ * <p>Created by zlokhandwala on 11/21/16.
+ */
+public class NoLogUnitHealingPolicy implements IReconfigurationHandlerPolicy {


I do not understand this name. It is not intuitive.

medhavidhawan · 2017-12-08T19:39:18Z

runtime/src/main/java/org/corfudb/runtime/view/LayoutManagementView.java

+     * @param currentLayout The current layout
+     * @param healedServers Set of healed server addresses
+     */
+    public void handleHealing(IReconfigurationHandlerPolicy failureHandlerPolicy,


param name should be healingHandlerPolicy

medhavidhawan · 2017-12-12T01:11:25Z

infrastructure/src/main/java/org/corfudb/infrastructure/management/FailureDetector.java

+        } else {
+            if (newPeriod != period && newPeriod != 0) {
+                period = newPeriod;
+                tuneRoutersResponseTimeout(membersSet, period);


this method should not have a side effect like tuningRoutersResponseTimeout .

Removed side effect

medhavidhawan · 2017-12-12T01:41:12Z

infrastructure/src/main/java/org/corfudb/infrastructure/management/HealingDetector.java

+ * - Poll result aggregation.
+ * - If we complete an iteration without detecting any healed nodes, we end the round.
+ * - Else we continue polling and generate the report with the healed node.
+ * The management Server ensures only one instance of this class and hence this is NOT thread safe.


I like the not thread safe disclaimer but I still am not a big fan of object scoped variables unless really making sense.

medhavidhawan · 2017-12-12T01:43:36Z

infrastructure/src/main/java/org/corfudb/infrastructure/management/HealingDetector.java

+            try {
+                pollCompletableFutures[i].get();
+                responses[i] = pollIteration;
+            } catch (Exception e) {


Why not just catch WrongEpochException before the generic Exception.

WrongEpochException is wrapped under the ExecutionException. So cannot catch in a separate catch block. This classification is required.

medhavidhawan · 2017-12-12T01:44:42Z

infrastructure/src/main/java/org/corfudb/infrastructure/management/HealingDetector.java

+            try {
+                pollCompletableFutures[i].get();
+                responses[i] = pollIteration;
+            } catch (Exception e) {


log errors for exception everywhere needed. It is hard to remember but we need to make sure we have enough logging.

I agree, but logging exceptions in this particular loop is really going to produce a lot of unwanted logs.
Since this is a poller fro healing nodes, it usually polls failed nodes and will encounter exceptions every second. We do not want this to pollute our debug logs.
For now, I will log when we encounter a WrongEpochException.

medhavidhawan · 2017-12-12T01:46:05Z

infrastructure/src/main/java/org/corfudb/infrastructure/management/HealingDetector.java

+            }
+
+            try {
+                Thread.sleep(interIterationInterval);


why do you have to do this and not use a scheduled executor feature to start after certain delay the next task.

This sleep and loop is executed only if there is a failure.
In an ideal scenario, each round consists of only one iteration as there are no failures.
If there are failures, the round confirms this failure by iterating thrice.
So, from my view, this reduces a bit of code complexity. Let me know if you feel otherwise.

medhavidhawan · 2018-01-18T03:02:09Z

infrastructure/src/main/java/org/corfudb/infrastructure/LayoutServer.java

@@ -128,6 +128,10 @@ public synchronized void handleMessageLayoutRequest(CorfuPayloadMsg<Long> msg,
        }
    }

+    public String le() {


What is this ? probably need to clean this up.

medhavidhawan · 2018-01-18T03:02:36Z

infrastructure/src/main/java/org/corfudb/infrastructure/ManagementAgent.java

+import org.corfudb.runtime.view.QuorumFuturesFactory;
+
+@Slf4j
+public class ManagementAgent {


medhavidhawan · 2018-01-18T03:04:50Z

infrastructure/src/main/java/org/corfudb/infrastructure/ManagementAgent.java

+    private final String bootstrapEndpoint;
+
+    @Getter
+    private volatile CompletableFuture<Boolean> sequencerBootstrappedFuture;


medhavidhawan · 2018-01-18T03:06:11Z

infrastructure/src/main/java/org/corfudb/infrastructure/ManagementAgent.java

+    private volatile CompletableFuture<Boolean> sequencerBootstrappedFuture;
+
+
+    public ManagementAgent(Callable<CorfuRuntime> getRuntime, ServerContext serverContext) {


This needs comments.

medhavidhawan · 2018-01-18T03:06:56Z

infrastructure/src/main/java/org/corfudb/infrastructure/ManagementAgent.java

+        }
+
+        serverContext.installSingleNodeLayoutIfAbsent();
+        serverContext.saveManagementLayout(serverContext.getCurrentLayout());


You are savingManagementLayout twice ?

Please put comments as this is very dense.

Added comments.

medhavidhawan · 2018-01-18T03:56:30Z

infrastructure/src/main/java/org/corfudb/infrastructure/ManagementAgent.java

+     * - This task is executed in intervals of 1 second (default). This task is blocked until
+     * the management server is bootstrapped and has a connected runtime.
+     * - On every invocation, this task refreshes the runtime to fetch the latest layout and also
+     * updates the local copy of the 'latestLayout'


No more latestLayout

medhavidhawan · 2018-01-18T04:04:05Z

infrastructure/src/main/java/org/corfudb/infrastructure/ManagementServer.java

-            log.info("Initiated Failure Handler.");
+        log.info("handleFailureDetectedMsg: Received DetectorMsg : {}", msg.getPayload());
+
+        DetectorMsg detectorMsg = msg.getPayload();


Update comments for this new logic. It is better.

medhavidhawan · 2018-01-18T04:17:01Z

infrastructure/src/main/java/org/corfudb/infrastructure/ServerContext.java

@@ -218,6 +232,7 @@ public String getNodeIdBase64() {
    public synchronized boolean installSingleNodeLayoutIfAbsent() {
        if ((Boolean) getServerConfig().get("--single") && getCurrentLayout() == null) {
            setCurrentLayout(getNewSingleNodeLayout());
+            log.error("HERE =====");


medhavidhawan · 2018-01-18T04:28:21Z

runtime/src/main/java/org/corfudb/runtime/view/LayoutSequencerHealingPolicy.java

+ *
+ * <p>Created by zlokhandwala on 11/21/16.
+ */
+public class LayoutSequencerHealingPolicy implements IReconfigurationHandlerPolicy {


rename it to SequencerHealingPolicy as it only heals Sequencers.

medhavidhawan · 2018-01-18T04:32:00Z

test/src/test/java/org/corfudb/runtime/view/ManagementViewTest.java

-                .initiateFailureHandler().get();
-        corfuRuntime.getRouter(SERVERS.ENDPOINT_2).getClient(ManagementClient.class)
-                .initiateFailureHandler().get();
+                getManagementServer(SERVERS.PORT_0).getManagementAgent().getCorfuRuntime(),


change this test to first fail and then heal the node.

no2chem

Please see the main set of comments in the HealingDetector class and follow that set of comments. In general the code is way too verbose for what it does.

no2chem · 2018-01-18T10:21:00Z

infrastructure/src/main/java/org/corfudb/infrastructure/SequencerServer.java

@@ -153,6 +153,10 @@ public SequencerServer(ServerContext serverContext) {
            globalLogTail.set(initialToken);
        }

+        if ((Boolean) opts.get("--single")) {
+            readyStateEpoch = serverContext.getNewSingleNodeLayout().getEpoch();


This isn't right, this generates a new layout which will always have epoch 0. You can use installSingleNodeLayoutIfAbsent, followed by getting the actual layout.

How do you get the actual layout ?
You cannot bootstrap the node again as it will throw an Already bootstrapped exception.
If there is a reconfiguration change and the cluster moves to a new epoch, then the sequencer becomes NOT_READY as this epoch is now stale anyways. The management server then takes care of bootstrapping the new primary sequencer for the new epoch.

installSingleNodeLayoutIfAbsent does not bootstrap the node. It installs a layout if it is not present. ManagementServer/LayoutServer constructors call the same function. Then just get the layout from the data store.

I'm not sure how an alreadybootstrap exception would be thrown, that isn't even thrown anywhere in ServerContext.

So I'm just saying that once we start a server as a single node (-s) we cannot bootstrap the node again later. As you said, the LayoutServer installs the layout from the local data store.

no2chem · 2018-01-18T10:22:20Z

infrastructure/src/main/java/org/corfudb/infrastructure/ServerContext.java

+    public synchronized void saveManagementLayout(Layout layout) {
+        // Cannot update with a null layout.
+        if (layout == null) {
+            log.warn("saveManagementLayout: Attempted to update with null layout");


Throw an exception.

no2chem · 2018-01-18T10:22:31Z

infrastructure/src/main/java/org/corfudb/infrastructure/ServerContext.java

@@ -368,16 +384,32 @@ public void setStartingAddress(long startingAddress) {
     *
     * @param layout Layout to be persisted
     */
-    public void setManagementLayout(Layout layout) {
-        dataStore.put(Layout.class, PREFIX_MANAGEMENT, MANAGEMENT_LAYOUT, layout);
+    public synchronized void saveManagementLayout(Layout layout) {


Needs to be annotated with @nonnull

no2chem · 2018-01-18T10:23:04Z

infrastructure/src/main/java/org/corfudb/infrastructure/management/FailureDetector.java

+ * Created by zlokhandwala on 11/29/17.
+ */
+@Slf4j
+public class FailureDetector implements IDetector {


Definitely we should NOT be unit testing private internal methods.

no2chem · 2018-01-18T10:23:19Z

infrastructure/src/main/java/org/corfudb/infrastructure/management/FailureDetector.java

+    /**
+     * Members to poll in every round
+     */
+    private String[] members;


This should be a list.

no2chem · 2018-01-18T12:00:24Z

runtime/src/main/java/org/corfudb/runtime/CorfuRuntime.java

@@ -658,8 +663,9 @@ private void checkClusterId(@Nonnull Layout layout) {
        // We haven't adopted a clusterId yet.
        if (clusterId == null) {
            clusterId = layout.getClusterId();
-            log.info("Connected to new cluster {}", clusterId == null ? "(legacy)" :
-                    UuidUtils.asBase64(clusterId));
+            if (clusterId != null) {


Whats the reason for this change?

if clusterId is null, the logs are filled with this log message.
This is because the management service invalidates the layout every second.

2018-01-16 18:35:46,776 INFO [ForkJoinPool.commonPool-worker-2] o.c.r.CorfuRuntime - Connected to new cluster (legacy) 2018-01-16 18:35:47,778 INFO [ForkJoinPool.commonPool-worker-2] o.c.r.CorfuRuntime - Connected to new cluster (legacy) 2018-01-16 18:35:48,778 INFO [ForkJoinPool.commonPool-worker-2] o.c.r.CorfuRuntime - Connected to new cluster (legacy) 2018-01-16 18:35:49,776 INFO [ForkJoinPool.commonPool-worker-2] o.c.r.CorfuRuntime - Connected to new cluster (legacy) 2018-01-16 18:35:50,780 INFO [ForkJoinPool.commonPool-worker-2] o.c.r.CorfuRuntime - Connected to new cluster (legacy) 2018-01-16 18:35:51,779 INFO [ForkJoinPool.commonPool-worker-2] o.c.r.CorfuRuntime - Connected to new cluster (legacy) 2018-01-16 18:35:52,776 INFO [ForkJoinPool.commonPool-worker-2] o.c.r.CorfuRuntime - Connected to new cluster (legacy) 2018-01-16 18:35:53,777 INFO [ForkJoinPool.commonPool-worker-2] o.c.r.CorfuRuntime - Connected to new cluster (legacy) 2018-01-16 18:35:54,774 INFO [ForkJoinPool.commonPool-worker-2] o.c.r.CorfuRuntime - Connected to new cluster (legacy) 2018-01-16 18:35:55,777 INFO [ForkJoinPool.commonPool-worker-2] o.c.r.CorfuRuntime - Connected to new cluster (legacy) 2018-01-16 18:35:56,778 INFO [ForkJoinPool.commonPool-worker-2] o.c.r.CorfuRuntime - Connected to new cluster (legacy) 2018-01-16 18:35:57,778 INFO [ForkJoinPool.commonPool-worker-2] o.c.r.CorfuRuntime - Connected to new cluster (legacy) 2018-01-16 18:35:58,777 INFO [ForkJoinPool.commonPool-worker-2] o.c.r.CorfuRuntime - Connected to new cluster (legacy)

Is there any other alternative? Should I keep this and reduce it to TRACE? It would still produce a lot of logs.

no2chem · 2018-01-18T12:00:46Z

runtime/src/main/java/org/corfudb/runtime/CorfuRuntime.java

+        /**
+         * Retries to connect to a disconnected node.
+         */
+        @Default int connectionRetries = 3;


Doesn't seem like it should be part of the runtime.

no2chem · 2018-01-18T12:01:30Z

runtime/src/main/java/org/corfudb/runtime/clients/NettyClientRouter.java

@@ -361,6 +362,9 @@ private synchronized void connectChannel(Bootstrap b)
                        log.warn("Exception while reconnecting, retry in {} ms", timeoutRetry, e);
                        Sleep.MILLISECONDS.sleepUninterruptibly(timeoutRetry);
                    }
+                    if (--retryCount == 0) {
+                        throw new NetworkException("Connection retry limit reached.", node);


This is not a good idea. No one will see this exception since it runs on a Netty future.

This code was throwing an un-handled exception earlier after a certain number of retries.

2018-01-17 15:49:49,899 WARN [client-1] i.n.u.c.DefaultPromise - An exception was thrown by org.corfudb.runtime.clients.NettyClientRouter$$Lambda$123/1147741970.operationComplete() org.corfudb.runtime.exceptions.NetworkException: Retry limit reached. [endpoint=tcp://localhost:9001/] at org.corfudb.runtime.clients.NettyClientRouter.lambda$connectChannel$2(NettyClientRouter.java:359)

no2chem · 2018-01-18T12:02:16Z

runtime/src/main/java/org/corfudb/runtime/view/SequencerHealingPolicy.java

+     * @throws LayoutModificationException Thrown if attempt to create an invalid layout.
+     */
+    @Override
+    public Layout generateLayout(Layout originalLayout,


javax annotations, please

no2chem · 2018-01-18T12:03:31Z

infrastructure/src/main/java/org/corfudb/infrastructure/management/FailureDetector.java

+import org.corfudb.runtime.view.Layout;
+import org.corfudb.util.Sleep;
+import org.corfudb.util.Utils;
+


Please see comments on healing detector. This logic can be summarized in a few lines and without state information.

zalokhan · 2018-01-19T02:23:11Z

@no2chem Addressed all comments as we discussed.

no2chem

Ok, looks much better to me

Maithem · 2018-01-19T05:13:44Z

Can you rebase plz lets get this in.

This reverts commit 18efb6a.

This reverts commit 7f744fd.

* Simulating partially synchronous n/w * Refactored detector code.

This reverts commit 18efb6a.

zalokhan self-assigned this Nov 30, 2017

zalokhan requested review from dahliamalkhi, no2chem, Maithem and medhavidhawan November 30, 2017 01:39

zalokhan force-pushed the newDetector branch from fe856fb to a1747d7 Compare November 30, 2017 17:45

no2chem added this to the 0.2.0 milestone Dec 4, 2017

zalokhan force-pushed the newDetector branch 2 times, most recently from 4120d32 to d11f306 Compare December 4, 2017 23:07

zalokhan mentioned this pull request Dec 4, 2017

LogUnits are not removed from the layout when declared unresponsive. #1036

Closed

zalokhan force-pushed the newDetector branch from d11f306 to 1267fb5 Compare December 5, 2017 19:34

no2chem reviewed Dec 5, 2017

View reviewed changes

rogermichoud mentioned this pull request Dec 6, 2017

Add node too slow (~7 minutes) #1045

Closed

no2chem requested changes Dec 6, 2017

View reviewed changes

zalokhan force-pushed the newDetector branch from 1267fb5 to 653298c Compare December 8, 2017 00:09

zalokhan mentioned this pull request Dec 8, 2017

AddNode sometimes fail to add it to the cluster #1035

Closed

zalokhan force-pushed the newDetector branch 2 times, most recently from a1c7b05 to c5354f8 Compare December 9, 2017 08:43

no2chem approved these changes Dec 11, 2017

View reviewed changes

medhavidhawan reviewed Dec 12, 2017

View reviewed changes

zalokhan force-pushed the newDetector branch 3 times, most recently from 534a051 to 3b4b075 Compare January 18, 2018 02:16

medhavidhawan reviewed Jan 18, 2018

View reviewed changes

zalokhan force-pushed the newDetector branch from 3b4b075 to 668e65a Compare January 18, 2018 06:53

no2chem requested changes Jan 18, 2018

View reviewed changes

zalokhan force-pushed the newDetector branch 3 times, most recently from bb38010 to b952852 Compare January 19, 2018 02:15

medhavidhawan approved these changes Jan 19, 2018

View reviewed changes

zalokhan force-pushed the newDetector branch from b952852 to 47de1ac Compare January 19, 2018 02:44

no2chem approved these changes Jan 19, 2018

View reviewed changes

zalokhan force-pushed the newDetector branch from 47de1ac to ccd20e7 Compare January 19, 2018 06:15

Simulating partially synchronous n/w

7dd57e6

zalokhan force-pushed the newDetector branch 3 times, most recently from f65ec1d to e143e94 Compare January 19, 2018 19:18

Refactored detector code.

64e4be3

zalokhan force-pushed the newDetector branch from e143e94 to 64e4be3 Compare January 19, 2018 19:28

Maithem merged commit 18efb6a into CorfuDB:master Jan 19, 2018

zalokhan deleted the newDetector branch January 19, 2018 20:12

no2chem added a commit that referenced this pull request Jan 19, 2018

Revert "Simulating partially synchronous n/w (#1026)"

965b6ea

This reverts commit 18efb6a.

no2chem mentioned this pull request Jan 19, 2018

Revert "Simulating partially synchronous n/w" #1159

Merged

no2chem added a commit that referenced this pull request Jan 19, 2018

Revert "Simulating partially synchronous n/w (#1026)" (#1159)

7f744fd

This reverts commit 18efb6a.

zalokhan added a commit that referenced this pull request Jan 19, 2018

Revert "Revert "Simulating partially synchronous n/w (#1026)" (#1159)"

a39f92b

This reverts commit 7f744fd.

zalokhan mentioned this pull request Jan 20, 2018

New failure detector #1162

Merged

4 tasks

no2chem pushed a commit that referenced this pull request Jan 23, 2018

Simulating partially synchronous n/w (#1026)

02f4081

* Simulating partially synchronous n/w * Refactored detector code.

no2chem added a commit that referenced this pull request Jan 23, 2018

Revert "Simulating partially synchronous n/w (#1026)" (#1159)

89f7d0f

This reverts commit 18efb6a.

		private volatile CompletableFuture<Boolean> sequencerBootstrappedFuture;


		public ManagementAgent(Callable<CorfuRuntime> getRuntime, ServerContext serverContext) {

Simulating partially synchronous n/w #1026

Simulating partially synchronous n/w #1026

Conversation

zalokhan commented Nov 30, 2017 • edited Loading

Partially Synchronous Network.

New Fault Detector.

Healing Detector.

coveralls commented Nov 30, 2017

coveralls commented Nov 30, 2017

corfudb-performance commented Dec 1, 2017

corfudb-performance commented Dec 5, 2017

Choose a reason for hiding this comment

corfudb-performance commented Dec 6, 2017

no2chem left a comment

Choose a reason for hiding this comment

zalokhan commented Dec 7, 2017

corfudb-performance commented Dec 8, 2017

coveralls commented Dec 9, 2017

corfudb-performance commented Dec 9, 2017

no2chem left a comment

Choose a reason for hiding this comment

coveralls commented Dec 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zalokhan Dec 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

no2chem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zalokhan Jan 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zalokhan commented Nov 30, 2017 •

edited

Loading

zalokhan Dec 13, 2017 •

edited

Loading

zalokhan Jan 18, 2018 •

edited

Loading