-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testnet Deployment #326
Testnet Deployment #326
Conversation
I've added a test file for the validation utils (strictly unit tests). Obviously for the more simple validators, it's probably not necessary, but there were a few cases with |
11dd3c8
to
4aacfb8
Compare
I've stupidly rebased, undoing the changes from this PR, and force-pushed, which caused the PR to close. Gonna go through reflog to try to undo and reopen the PR. |
2c549df
to
3600f3d
Compare
While deploying the HTTP-Demo we learned that if you set an This means since our Setting the entrypoint is preferable if our container has only 1 executable. In the case of PK, this is just The only reason to use Note that Task Definitions also allow overriding the |
Additionally we created a new security group just for testing, it's called the The deployed container requires several resources:
The end result assuming everything is setup, is that the container is assigned a public IP address that we can access. This is a bit hard to find, but it is in Each public IP here provided by an ENI. It is automatically created by AWS in this case. |
Because we have created the {
"ipcMode": null,
"executionRoleArn": "arn:aws:iam::015248367786:role/ecsTaskExecutionRole",
"containerDefinitions": [
{
"dnsSearchDomains": null,
"environmentFiles": null,
"logConfiguration": {
"logDriver": "awslogs",
"secretOptions": null,
"options": {
"awslogs-group": "/ecs/socat-echo-server",
"awslogs-region": "ap-southeast-2",
"awslogs-stream-prefix": "ecs"
}
},
"entryPoint": [
"sh",
"-c"
],
"portMappings": [
{
"hostPort": 1314,
"protocol": "tcp",
"containerPort": 1314
},
{
"hostPort": 1315,
"protocol": "tcp",
"containerPort": 1315
},
{
"hostPort": 1316,
"protocol": "udp",
"containerPort": 1316
}
],
"command": [
"socat SYSTEM:\"echo hello; cat\" TCP4-LISTEN:1314`printf \"\\x2c\"`FORK & socat PIPE TCP4-LISTEN:1315`printf \"\\x2c\"`FORK & socat PIPE UDP4-LISTEN:1316`printf \"\\x2c\"`FORK"
],
"linuxParameters": null,
"cpu": 0,
"environment": [],
"resourceRequirements": null,
"ulimits": null,
"dnsServers": null,
"mountPoints": [],
"workingDirectory": null,
"secrets": null,
"dockerSecurityOptions": null,
"memory": null,
"memoryReservation": null,
"volumesFrom": [],
"stopTimeout": null,
"image": "alpine/socat:latest",
"startTimeout": null,
"firelensConfiguration": null,
"dependsOn": null,
"disableNetworking": null,
"interactive": null,
"healthCheck": null,
"essential": true,
"links": null,
"hostname": null,
"extraHosts": null,
"pseudoTerminal": null,
"user": null,
"readonlyRootFilesystem": null,
"dockerLabels": null,
"systemControls": null,
"privileged": null,
"name": "socat-echo-server"
}
],
"placementConstraints": [],
"memory": "512",
"taskRoleArn": "arn:aws:iam::015248367786:role/ecsTaskExecutionRole",
"compatibilities": [
"EC2",
"FARGATE"
],
"taskDefinitionArn": "arn:aws:ecs:ap-southeast-2:015248367786:task-definition/socat-echo-server:6",
"family": "socat-echo-server",
"requiresAttributes": [
{
"targetId": null,
"targetType": null,
"value": null,
"name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
},
{
"targetId": null,
"targetType": null,
"value": null,
"name": "ecs.capability.execution-role-awslogs"
},
{
"targetId": null,
"targetType": null,
"value": null,
"name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
},
{
"targetId": null,
"targetType": null,
"value": null,
"name": "com.amazonaws.ecs.capability.task-iam-role"
},
{
"targetId": null,
"targetType": null,
"value": null,
"name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
},
{
"targetId": null,
"targetType": null,
"value": null,
"name": "ecs.capability.task-eni"
}
],
"pidMode": null,
"requiresCompatibilities": [
"FARGATE"
],
"networkMode": "awsvpc",
"runtimePlatform": null,
"cpu": "256",
"revision": 6,
"status": "ACTIVE",
"inferenceAccelerators": null,
"proxyConfiguration": null,
"volumes": []
} |
So now we should be able to setup PK without any of the bells and whistles similar to http-demo.
|
Success! We have a running Polykey instance on AWS, with a successful unattended bootstrap to create all of its node state. There were a few things we've done to get this working:
|
Success! Was able to contact our spun up remote keynode on AWS, by setting the
Initially I was having troubles with this because I had forgotten to include the node ID in the CLI arguments. Without remembering that we needed a node ID as part of the verification step, it was tricky to debug because there was no error thrown (even despite supplying the client host and client port):
|
Yep confirmed it is working as well, plus the session token gets saved in in your |
So the problem is that if you specify
Which means when I run:
I actually get an exception, but this is because it's complaining that it cannot "resolve" the So if there is a status file, and we try to use it from there, we should get a proper error instead of just:
|
Confirmed the above was the case. After deleting my local keynode state, I was able to reproduce the expected error that @CMCDragonkai received when not specifying the node ID. |
I'm going to change the |
So there are 2 problems with The first is due to This is a problem because we want the user to understand why we wanted to read the status if one of the parameters is not supplied. So to solve this we need an exception message saying that we read the status because of missing parameter which can be one or more of These 3 parameters should be especially noted as the parameters controlling connection to an agent. The second problem is that when the status does exist, it is read but only missing parameters are filled with status values. Somehow this leads saying that the status is dead. Instead what should happen is that it should try contacting the agent with potentially the wrong node id. if (statusInfo.status === 'LIVE') {
if (nodeId == null) nodeId = statusInfo.data.nodeId;
if (clientHost == null) clientHost = statusInfo.data.clientHost;
if (clientPort == null) clientPort = statusInfo.data.clientPort;
return {
statusInfo,
status,
nodeId,
clientHost,
clientPort,
};
} It appears this is because it only overrides the parameters if it is |
First change we can do is to override the parameters even if it is not LIVE. However that's not enough to solve the problem. After this the What we could do is that if any of the parameters are set, we can override with what's read in the status and then return with Here are the cases:
If we cannot contact then we have to indicate a client connection exception or a status exception but provide reasoning as to why. Part of it is that no matter what, we couldn't fulfill all 3 required parameters. This should only occur in case 6. And in case 3 if non-LIVE status (or bug in our program). |
To solve this, we can make use of the Right now this is: type StatusLive = {
status: 'LIVE';
data: {
pid: number;
nodeId: NodeId;
clientHost: Host;
clientPort: Port;
ingressHost: Host;
ingressPort: Port;
[key: string]: any;
};
}; Remember the output of the data: {
status: 'LIVE',
pid,
nodeId,
clientHost,
clientPort,
ingressHost,
ingressPort,
egressHost,
egressPort,
agentHost,
agentPort,
proxyHost,
proxyPort,
rootPublicKeyPem,
rootCertPem,
rootCertChainPem,
}, We can say both should output this. This means we can add the additional properties here to tour Note that the hosts and ports do not change during the lifetime of the program, but when the key pair changes, that affects the |
The
However this is only used in:
However the certificate itself may be changed in Right now the So the type RootKeyPairChangeData = {
nodeId: NodeId;
rootKeyPair: KeyPair;
rootCert: Certificate;
rootCertChain: Array<CertificatePem>;
recoveryCode: RecoveryCode;
} Note that if the keypair has changed, then the recovery code changes right? So we need to add this in as well!! @emmacasolin I would also change the name of this to be a callback that is called whenever there's a change to the const eventRootKeyPairChange = Symbol('rootKeyPairChange'); We can use |
@joshuakarp when the However how does it get access to the new key? Is it calling the |
It may be better to just pay the cost to start the generalisation process now then. At least for the |
A problem I can see with doing this is that I remember running into a similar problem when I was working on the Discovery Queue, which is one of the reasons why that queue ended up being entirely contained within the Discovery domain rather than being a separate class. Actually, if the queue class gets passed in during |
This should easily be resolved by always passing in arrow functions. Arrow functions will encapsulate the So imagine:
|
`NodeManager.setNode` and `NodeConnectionManager.syncNodeGraph` now utilise a single, shared queue to asynchronously add nodes to the node graph without blocking the main loop. These methods are both blocking by default but can be made non-blocking by setting the `block` parameter to false. #322
The generic queue class |
What is happening to these?
I currently have them stubbed out. They were commented out but I don't have the story on why. Currently they make up about 2/3rds of the test errors. |
Tests are passing in CICD now. |
Just a small refactor. I've renamed some methods since queueStart and queuePush is unnecessarily verbose when Queue is its own class now. Also simplified some logic using the `promise` utility.
This contains fixes for failing tests as well as fixes for tests failing to exit when finished.
This checks if we await things that are not promises. This is not a problem per se, but we generally don't want to await random things.
3e5e82c
to
08beafc
Compare
…es when entering the network This tests for if the Seed node contains the new nodes when they are created. It also checks if the new nodes discover each other after being created. Includes a change to `findNode`. It will no longer throw an error when failing to find the node. This will have to be thrown by the caller now. This was required by `refreshBucket` since it's very likely that we can't find the random node it is looking for.
882d66b
to
6b75ad0
Compare
Some tests are failing randomly. Need to look out for them and take note. |
Are they the usual suspects:
? Or are the timing problems like key generation taking too long? |
I don't have much details on it right now. I've seen gestalts failing from time to time but also some other domains are failing in CICD that I have to re-trigger. I'll have to look deeper into it when I get a chance. |
…stLocalNode` and `nodesHolePunchMessage`
This PR has been migrated to #378. |
Description
Once #310 has been merged, we'll finally be able to move onto the deployment of our testnet into AWS.
I foresee this to be achievable in a few stages:
Issues Fixed
Fixes Support host (IP address) inbeing completed in Extracting Node Connection Management out ofparseSeedNodes
#324NodeManager
toNodeConnectionManager
#310Testnet deployment:
NodeGraph Structure:
NodeGraph
bucket operations #244Node Adding Policies:
NodeGraph
#322NodeGraph
#344NodeConnectionManager
methods #363Node Removal Policies:
NodeGraph
#150NodeGraph
buckets #345Tasks
[ ] 2. Complete Support host (IP address) in- being completed in Extracting Node Connection Management out ofparseSeedNodes
#324NodeManager
toNodeConnectionManager
#310[ ] 6. Created automated testing that utilises the testnet- to be done in Tests for NAT-Traversal and Hole-Punching #357[ ] - NAT-Traversal Testing with testnet.polykey.io #159[ ] - Create automated tests for establishing connection via hole-punch signalling message #161nodes
tests Testnet Node Deployment (testnet.polykey.io) #194 (comment)nodesChainDataGet
nodesClosestLocalNode
nodesHolePunchMessage
Cmd
toEntryPoint
because PK only has 1 executable and makesdocker run ...
easier.--seed-nodes='<defaults>;...'
will now mean that any specified seed nodes overrides the defaults rather than the other way aroundNodeID
, the ownNodeId
will be automatically filtered.pk agent start
return status information such asnodeId
and not justrecoveryCode
agent start
and the like), it needs to be made more obvious whether we are contacting the local agent or remote agent Testnet Deployment #326 (comment)ErrorCLIClientOptions
.NodeGraph
: see issue Seed node not adding details of connecting node to itsNodeGraph
#344NodeGraph
for a connecting node[ ] a. No output provided on- moved this to https://github.com/MatrixAI/js-polykey/issues/334#issuecomment-1043779027pk identities trust
? However, I understand that we've previously shied away from providing "success" output on every single one of our commands. It can be tricky to diagnose problems when this is the case though.[ ] b.- moved this to https://github.com/MatrixAI/js-polykey/issues/334#issuecomment-1043779027pk identities trust
seemingly not adding a node to the gestalt graph when it doesn't already exist inGestaltGraph
/NodeGraph
- see Testnet Deployment #326 (comment) and https://matrixai.slack.com/archives/CEAUUV5QX/p1645069875382019SyntaxError: Unexpected end of data
on an "invalid" node ID when usingdecodeNodeId
- rebased on master has new js-id 3.3.1 that catches syntax errors when decoding multibase encoded stringsNodeGraph
structureNodeGraph
bucket operations #244buckets
sublevel will contain each bucket, where each bucket sublevel containsNodeId
toNodeData
meta
sublevel contains each bucket, where each bucket sublevel is a config structure, currently onlycount
lastUpdated
orindex
sublevel that contains each bucket, and each bucket sublevel containslexi(lastUpdated)-NodeId
toNodeId
, the key is a compound index, allowing us to efficiently acquire the most up-to-date or least up-to-date node entrygetNodes
andgetBuckets
will allow us to interrogate the state of theNodeGraph
efficiently, by streaming data out of theNodeGraph
, this will be important for debugging, and later analyticsDBTransaction
and new iterator feature of sublevels that allow us to maintain a snapshot of the DB at a point in timegetClosestNode
method based on newNodeGraph
structureNodeGraph
andNodeManager
should be tightly coupled due to the adding and removal policies of nodes, maintaining the bucket limit must involving pinging old nodesFinal checklist