Skip to content

Latest commit

 

History

History
4815 lines (3707 loc) · 224 KB

operator-guide.md

File metadata and controls

4815 lines (3707 loc) · 224 KB

Manta v1 Operator Guide

Manta (v1) is an internet-facing object store with in-situ Unix-based compute as a first class operation. The user interface to Manta is essentially:

  • A filesystem-like namespace, with directories and objects, accessible over HTTP
  • Objects are arbitrary-size blobs of data
  • Users can use standard HTTP PUT/GET/DELETE verbs to create and remove directories and objects as well as to list directories
  • Users can fetch arbitrary ranges of an object, but may not modify an object except by replacing it
  • Users submit map-reduce compute jobs that run arbitrary Unix programs on their objects.

Users can interact with Manta through the official Node.js CLI; the Triton user portal; the Node, Python, Ruby, or Java SDKs; curl(1); or any web browser.

For more information, see the official public user documentation. Before reading this document, you should be very familiar with using Manta, including both the CLI tools and the compute jobs features. You should also be comfortable with all the reference material on how the system works from a user's perspective.

(Note: This is the operator guide for Manta v1. See this document for information on mantav1 vs mantav2. If you are using mantav2, please see the Mantav2 Operator Guide.)

Architecture basics

Design constraints

Horizontal scalability. It must be possible to add more hardware to scale any component within Manta without downtime. As a result of this constraint, there are multiple instances of each service.

Strong consistency. In the face of network partitions where it's not possible to remain both consistent and available, Manta chooses consistency. So if all three datacenters in a three-DC deployment become partitioned from one another, requests may fail rather than serve potentially incorrect data.

High availability. Manta must survive failure of any service, physical server, rack, or even an entire datacenter, assuming it's been deployed appropriately. Development installs of Manta can fit on a single system, and obviously those don't survive server failure, but several production deployments span three datacenters and survive partitioning or failure of an entire datacenter without downtime for the other two.

Basic terminology

We use nodes to refer to physical servers. Compute nodes mean the same thing they mean in Triton, which is any physical server that's not a head node. Storage nodes are compute nodes that are designated to store actual Manta objects. These are the same servers that run users' compute jobs, but we don't call those compute nodes because that would be confusing with the Triton terminology.

A Manta install uses:

  • a headnode (see "Manta and Triton" below)
  • one or more storage nodes to store user objects and run compute jobs
  • one or more non-storage compute nodes for the other Manta services.

We use the term datacenter (or DC) to refer to an availability zone (or AZ). Each datacenter represents a single Triton deployment (see below). Manta supports being deployed in either 1 or 3 datacenters within a single region, which is a group of datacenters having a high-bandwidth, low-latency network connection.

Manta and Triton (SDC)

Manta is built atop Triton (formerly known as SmartDataCenter). A three-datacenter deployment of Manta is built atop three separate Triton deployments. The presence of Manta does not change the way Triton is deployed or operated. Administrators still have AdminUI, APIs, and they're still responsible for managing the Triton services, platform versions, and the like through the normal Triton mechanisms.

Components of Manta

All user-facing Manta functionality can be divided into a few major subsystems:

  • The storage tier is responsible for storing the physical copies of user objects on disk. Storage nodes store objects as files with random uuids. So within each storage node, the objects themselves are effectively just large, opaque blobs of data.
  • The metadata tier is responsible for storing metadata about each object that's visible from the public Manta API. This metadata includes the set of storage nodes on which the physical copy is stored.
  • The jobs subsystem (also called Marlin) is responsible for executing user programs on the objects stored in the storage tier.

In order to make all this work, there are several other pieces:

  • The front door is made up of the SSL terminators, load balancers, and API servers that actually handle user HTTP requests. All user interaction with Manta happens over HTTP (even compute jobs), so the front door handles all user-facing operations.
  • An authentication cache maintains a read-only copy of the UFDS account database. All front door requests are authenticated against this cache.
  • A garbage collection and auditing system periodically compares the contents of the metadata tier with the contents of the storage tier to identify deleted objects, remove them, and verify that all other objects are replicated as they should be.
  • An accelerated garbage collection system that leverages some additional constraints on object uploads to allow for faster space reclamation.
  • A metering system periodically processes log files generated by the rest of the system to produce reports that are ultimately turned into invoices.
  • A couple of dashboards provide visibility into what the system is doing at any given point.
  • A consensus layer is used to keep track of primary-secondary relationships in the metadata tier.
  • DNS-based nameservices are used to keep track of all instances of all services in the system.

Services, instances, and agents

Just like with Triton, components are divided into services, instances, and agents. Services and instances are SAPI concepts.

A service is a group of instances of the same kind. For example, "jobsupervisor" is a service, and there may be multiple jobsupervisor zones. Each zone is an instance of the "jobsupervisor" service. The vast majority of Manta components are service instances, and there are several different services involved.

Agents are components that run in the global zone. Manta uses one agent on each storage node called the marlin agent in order to manage the execution of user compute jobs on each storage node.

Note: Do not confuse SAPI services with SMF services. We're talking about SAPI services here. A given SAPI instance (which is a zone) may have many SMF services.

Manta components at a glance

Kind Major subsystem Service Purpose Components
Service Consensus nameservice Service discovery ZooKeeper, binder (DNS)
Service Front door loadbalancer SSL termination and load balancing stud, haproxy/muppet
Service Front door webapi Manta HTTP API server muskie
Service Front door authcache Authentication cache mahi (redis)
Service Metadata postgres Metadata storage and replication postgres, manatee
Service Metadata moray Key-value store moray
Service Metadata electric-moray Consistent hashing (sharding) electric-moray
Service Storage storage Object storage and capacity reporting mako (nginx), minnow
Service Operations ops GC, audit, and metering cron jobs mola, mackerel
Service Operations madtom Web-based Manta monitoring madtom
Service Operations marlin-dashboard Web-based Marlin monitoring marlin-dashboard
Service Compute jobsupervisor Distributed job orchestration jobsupervisor
Service Compute jobpuller Job archival wrasse
Service Compute marlin Compute containers for end users marlin-lackey
Agent Compute marlin-agent Job execution on each storage node marlin-agent
Agent Compute medusa Interactive Session Engine medusa

Consensus and internal service discovery

In some sense, the heart of Manta (and Triton) is a service discovery mechanism (based on ZooKeeper) for keeping track of which service instances are running. In a nutshell, this works like this:

  1. Setup: There are 3-5 "nameservice" zones deployed that form a ZooKeeper cluster. There's a "binder" DNS server in each of these zones that serves DNS requests based on the contents of the ZooKeeper data store.
  2. Setup: When other zones are deployed, part of their configuration includes the IP addresses of the nameservice zones. These DNS servers are the only components that each zone knows about directly.
  3. When an instance starts up (e.g., a "moray" zone), an SMF service called the registrar connects to the ZooKeeper cluster (using the IP addresses configured with the zone) and publishes its own IP to ZooKeeper. A moray zone for shard 1 in region "us-central" publishes its own IP under "1.moray.us-central.mnx.io".
  4. When a client wants to contact the shard 1 moray, it makes a DNS request for 1.moray.us-central.mnx.io using the DNS servers in the ZooKeeper cluster. Those DNS servers returns all IPs that have been published for 1.moray.us-central.mnx.io.
  5. If the registrar in the 1.moray zone dies, the corresponding entry in the ZooKeeper data store is automatically removed, causing that zone to fall out of DNS. Due to DNS TTLs of 60s, it may take up to a minute for clients to notice that the zone is gone.

Internally, most services work this way.

External service discovery

Since we don't control Manta clients, the external service discovery system is simpler and more static. We manually configure the public us-central.manta.mnx.io DNS name to resolve to each of the loadbalancer public IP addresses. After a request reaches the loadbalancers, everything uses the internal service discovery mechanism described above to contact whatever other services they need.

Storage tier

The storage tier is made up of Mantis Shrimp nodes. Besides having a great deal of physical storage in order to store users' objects, these systems have lots of DRAM in order to support a large number of marlin compute zones, where we run user programs directly on their data.

Each storage node has an instance of the storage service, also called a "mako" or "shark" (as in: a shard of the storage tier). Inside this zone runs:

  • mako: an nginx instance that supports simple PUT/GET for objects. This is not the front door; this is used internally to store each copy of a user object. Objects are stored in a ZFS delegated dataset inside the storage zone, under /manta/$account_uuid/$object_uuid.
  • minnow: a small Node service that periodically reports storage capacity data into the metadata tier so that the front door knows how much capacity each storage node has.

In addition to the "storage" zone, each storage node has some number of marlin zones (or compute zones). These are essentially blank zones in which we run user programs. We currently configure 128 of these zones on each storage node.

Metadata tier

The metadata tier is itself made up of three levels:

  • "postgres" zones, which run instances of the postgresql database
  • "moray" zones, which run a key-value store on top of postgres
  • "electric-moray" zones, which handle sharding of moray requests

Postgres, replication, and sharding

All object metadata is stored in PostgreSQL databases. Metadata is keyed on the object's name, and the value is a JSON document describing properties of the object including what storage nodes it's stored on.

This part is particularly complicated, so pay attention! The metadata tier is replicated for availability and sharded for scalability.

It's easiest to think of sharding first. Sharding means dividing the entire namespace into one or more shards in order to scale horizontally. So instead of storing objects A-Z in a single postgres database, we might choose two shards (A-M in shard 1, N-Z in shard 2), or three shards (A-I in shard 1, J-R in shard 2, S-Z in shard 3), and so on. Each shard is completely separate from the others. They don't overlap at all in the data that they store. The shard responsible for a given object is determined by consistent hashing on the directory name of the object. So the shard for "/mark/stor/foo" is determined by hashing "/mark/stor".

Within each shard, we use multiple postgres instances for high availability. At any given time, there's a primary peer (also called the "master"), a secondary peer (also called the "synchronous slave"), and an async peer (sometimes called the "asynchronous slave"). As the names suggest, we configure synchronous replication between the primary and secondary, and asynchronous replication between the secondary and the async peer. Synchronous replication means that transactions must be committed on both the primary and the secondary before they can be committed to the client. Asynchronous replication means that the asynchronous peer may be slightly behind the other two.

The idea with configuration replication in this way is that if the primary crashes, we take several steps to recover:

  1. The shard is immediately marked read-only.
  2. The secondary is promoted to the primary.
  3. The async peer is promoted to the secondary. With the shard being read-only, it should quickly catch up.
  4. Once the async peer catches up, the shard is marked read-write again.
  5. When the former primary comes back online, it becomes the asynchronous peer.

This allows us to quickly restore read-write service in the event of a postgres crash or an OS crash on the system hosting the primary. This process is managed by the "manatee" component, which uses ZooKeeper for leader election to determine which postgres will be the primary at any given time.

It's really important to keep straight the difference between sharding and replication. Even though replication means that we have multiple postgres instances in each shard, only the primary can be used for read/write operations, so we're still limited by the capacity of a single postgres instance. That's why we have multiple shards.

Other shards

There are actually three kinds of metadata in Manta:

  • Object metadata, which is sharded as described above. This may be medium to high volume, depending on load.
  • Storage node capacity metadata, which is reported by "minnow" instances (see above) and all lives on one shard. This is extremely low-volume: a couple of writes per storage node per minute.
  • Compute job state, which is all stored on a single shard. This is extremely high volume, depending on job load.

In the us-central production deployment, shard 1 stores compute job state and storage node capacity. Shards 2-4 store the object metadata.

Manta supports resharding object metadata, which would typically be used to add an additional shard (for additional capacity). This operation has never been needed (or used) in production. Assuming the service is successful, that's likely just a matter of time.

Moray

For each metadata shard (which we said above consists of three PostgreSQL databases), there's two or more "moray" instances. Moray is a key-value store built on top of postgres. Clients never talk to postgres directly; they always talk to Moray. (Actually, they generally talk to electric-moray, which proxies requests to Moray. See below.) Moray keeps track of the replication topology (which Postgres instances is the primary, which is the secondary, and which is the async) and directs all read/write requests to the primary Postgres instance. This way, clients don't need to know about the replication topology.

Like Postgres, each Moray instance is tied to a particular shard. These are typically referred to as "1.moray", "2.moray", and so on.

Electric-moray

The electric-moray service sits in front of the sharded Moray instances and directs requests to the appropriate shard. So if you try to update or fetch the metadata for /mark/stor/foo, electric-moray will hash /mark/stor to find the right shard and then proxy the request to one of the Moray instances operating that shard.

The front door

The front door consists of "loadbalancer" and "webapi" zones.

"loadbalancer" zones actually run both stud (for SSL termination) and haproxy (for load balancing across the available "webapi" instances). "haproxy" is managed by a component called "muppet" that uses the DNS-based service discovery mechanism to keep haproxy's list of backends up-to-date.

"webapi" zones run the Manta-specific API server, called muskie. Muskie handles PUT/GET/DELETE requests to the front door, including requests to:

  • create and delete objects
  • create, list, and delete directories
  • create compute jobs, submit input, end input, fetch inputs, fetch outputs, fetch errors, and cancel jobs
  • create multipart uploads, upload parts, fetch multipart upload state, commit multipart uploads, and abort multipart uploads

Objects and directories

Requests for objects and directories involve:

  • validating the request
  • authenticating the user (via mahi, the auth cache)
  • looking up the requested object's metadata (via electric moray)
  • authorizing the user for access to the specified resource

For requests on directories and zero-byte objects, the last step is to update or return the right metadata.

For write requests on objects, muskie then:

  • Constructs a set of candidate storage nodes that will be used to store the object's data, where each storage node is located in a different datacenter (in a multi-DC configuration). By default, there are two copies of the data, but users can configure this by setting the durability level with the request.
  • Tries to issue a PUT with 100-continue to each of the storage nodes in the candidate set. If that fails, try another set. If all sets are exhausted, fail with 503.
  • Once the 100-continue is received from all storage nodes, the user's data is streamed to all nodes. Upon completion, there should be a 204 response from each storage node.
  • Once the data is safely written to all nodes, the metadata tier is updated (using a PUT to electric-moray), and a 204 is returned to the client. At this point, the object's data is recorded persistently on the requested number of storage nodes, and the metadata is replicated on at least two index nodes.

For read requests on objects, muskie instead contacts each of the storage nodes hosting the data and streams data from whichever one responds first to the client.

Compute jobs

Requests to manipulate compute jobs generally translate into creating or listing job-related Moray records:

  • When the user submits a request to create a job, muskie creates a new job record in Moray.
  • When the user submits a request to add an input, muskie creates a new job input record in Moray.
  • When the user submits a request to cancel a job or end input, muskie modifies the job record in Moray.
  • When the user lists inputs, outputs, or errors, muskie lists job input records, task output records, or error records.

All of these requests operate on the shard storing all of the compute node metadata. These requests do not go through electric-moray.

Compute tier (a.k.a., Marlin)

There are three core components of Marlin:

  • A small fleet of supervisors manages the execution of jobs. (Supervisors used to be called workers, and you may still see that terminology). Supervisors pick up new inputs, locate the servers where the input objects are stored, issue tasks to execute on those servers, monitor the execution of those tasks, and decide when the job is done. Each supervisor can manage many jobs, but each job is managed by only one supervisor at a time.
  • Job tasks execute directly on the Manta storage nodes. Each node has an agent (i.e., the "marlin agent") running in the global zone that manages tasks assigned to that node and the zones available for running those tasks.
  • Within each compute zone, task execution is managed by a lackey under the control of the agent.

All of Marlin's state is stored in Moray. A few other components are involved in executing jobs:

  • Muskie handles all user requests related to jobs: creating jobs, submitting input, and fetching status, outputs, and errors. To create jobs and submit inputs, Muskie creates and updates records in Moray. See above for details.
  • Wrasse (the jobpuller) is a separate component that periodically scans for recently completed jobs, archives the saved state into flat objects back in Manta, and then removes job state from Moray. This is critical to keep the database that manages job state from growing forever.

Distributed execution

When the user runs a "map" job, Muskie receives client requests to create the job, to add inputs to the job, and to indicate that there will be no more job inputs. Jobsupervisors compete to take newly assigned jobs, and exactly one will win and become responsible for orchestrating the execution of the job. As inputs are added, the supervisor resolves each object's name to the internal uuid that identifies the object and checks whether the user is allowed to access that object. Assuming the user is authorized for that object, the supervisor locates all copies of the object in the fleet, selects one, and issues a task to an agent running on the server where that copy is stored. This process is repeated for each input object, distributing work across the fleet.

The agent on the storage server accepts the task and runs the user's script in an isolated compute zone. It records any outputs emitted as part of executing the script. When the task has finished running, the agent marks it completed. The supervisor commits the completed task, marking its outputs as final job outputs. When there are no more unprocessed inputs and no uncommitted tasks, the supervisor declares the job done.

If a task fails for a retryable reason, it will be retried a few times, preferably on different servers. If it keeps failing, an error is produced.

Multi-phase map jobs are similar except that the outputs of each first-phase map task become inputs to a new second-phase map task, and only the outputs of the second phase become outputs of the job.

Reducers run like mappers, except that the input for a reducer is not completely known until the previous phase has already completed, and reducers can read an arbitrary number of inputs so the inputs themselves are dispatched as individual records and a separate end-of-input must be issued before the reducer can complete.

Local execution

The agent on each storage server maintains a fixed set of compute zones in which user scripts can be run. When a map task arrives, the agent locates the file representing the input object on the local filesystem, finds a free compute zone, maps the object into the local filesystem, and runs the user's script, redirecting stdin from the input file and stdout to a local file. When the script exits, assuming it succeeds, the output file is saved as an object in the object store, recorded as an output from the task, and the task is marked completed. If there is more work to be done for the same job, the agent may choose to run it in the same compute zone without doing anything to clean up after the first one. When there is no more work to do, or the agent decides to repurpose the compute zone for another job, the compute zone is halted, the filesystem rolled back to its pristine state, and the zone is booted again to run the next task. Since the compute zones themselves are isolated from one another and they are fully rebooted and rolled back between jobs, there is no way for users' jobs to see or interfere with other jobs running in the system.

Internal communication within Marlin

The system uses a Moray/PostgreSQL shard for all communication. There are buckets for jobs, job inputs, tasks, task inputs (for reduce tasks), task outputs, and errors. Supervisors and agents poll for new records applicable to them. For example, supervisors poll for tasks assigned to them that have been completed but not committed and agents poll for tasks assigned to them that have been dispatched but not accepted.

Multipart uploads

Multipart uploads provide an alternate way for users to upload Manta objects. The user creates the multipart upload, uploads the object in parts, and exposes the object in Manta by committing the multipart upload. Generally, these operations are implemented using existing Manta constructs:

  • Parts are normal Manta objects, with a few key differences. Users cannot use the GET, POST or DELETE HTTP methods on parts. Additionally, all parts are co-located on the same set of storage nodes, which are selected when the multipart upload is created.
  • All parts for a given multipart upload are stored in a parts directory, which is a normal Manta directory.
  • Part directories are stored in the top-level /$MANTA_USER/uploads directory tree.

Most of the logic for multipart uploads is performed by Muskie, but there are some additional features of the system only used for multipart uploads:

  • the manta_uploads bucket in Moray stores finalizing records for a given shard. A finalizing record is inserted atomically with the target object record when a multipart upload is committed.
  • the mako zones have a custom mako-finalize operation invoked by muskie when a multipart upload is committed. This operation creates the target object from the parts and subsequently deletes the parts from disk. This operation is invoked on all storage nodes that will contain the target object when the multipart upload is committed.

Garbage collection, auditing, and metering

Garbage collection, auditing, and metering all run as cron jobs out of the "ops" zone.

Garbage collection is the process of freeing up storage used for objects which no longer exist. When an object is deleted, muskie records that event in a log and removes the metadata from Moray, but does not actually remove the object from storage servers because there may have been other links to it. The garbage collection job (called "mola") processes these logs, along with dumps of the metadata tier (taken periodically and stored into Manta), and determines which objects can safely be deleted. These delete requests are batched and sent to each storage node, which moves the objects to a "tombstone" area. Objects in the tombstone area are deleted after a fixed interval.

Multipart upload garbage collection is the process of cleaning up data associated with finalized multipart uploads. When a multipart upload is finalized (committed or aborted), there are several items associated with it that need to be cleaned up, including:

  • its upload directory metadata
  • part object metadata
  • part data on disk (if not removed during the mako-finalize operation)
  • its finalizing metadata

Similar to the basic garbage collection job, there is a multipart upload garbage collection job that operates on dumps of the metadata tier to determine what data associated with multipart uploads can be safely deleted. A separate operation deletes these items after the job has been completed.

Auditing is the process of ensuring that each object is replicated as expected. This is a similar job run over the contents of the metadata tier and manifests reported by the storage nodes.

Metering is the process of measuring how much resource each user used, both for reporting and billing. There's compute metering (how much compute time was used), storage metering (how much storage is used), request metering, and bandwidth metering. These are compute jobs run over the compute logs (produced by the marlin agent), the metadata dumps, and the muskie request logs.

Accelerated Garbage Collection

Accelerated garbage-collection is a mechanism for lower latency storage space reclamation for objects owned by accounts whose data is guaranteed not to be snaplinked. Unlike traditional garbage-collection, accelerated garbage-collection does not rely on the jobs-tier or backups of the metadata tier. However, the actual object file deletion is still done with the cron-scheduled mako_gc.sh script.

Manta's offline garbage-collection system relies on a Manta job that periodically processes database dumps from all shards to locate unreferenced object data and mark it for cleanup. Objects which are found to only be referenced in the manta_delete_log after a system-wide grace-period are marked for deletion. This grace-period is intended to reduce the probability of deleting a snaplinked object. The reason the system processes database dumps from all shards is that, in a traditional setting, a single object may have multiple references on different metadata nodes.

Accelerated garbage-collection processes only objects that are owned by snaplink-disabled accounts. Such objects are guaranteed to have a single metadata reference, and the delete records generated for them are added to the manta_fastdelete_queue as opposed to the manta_delete_log. A "garbage-collector" reads records from the queue and marks the corresponding objects for deletion. Since metadata records are removed from the manta table and added to the manta_fastdelete_queue in the same database transaction, rows in the manta_fastdelete_queue refer to objects that are no longer visible to Manta users.

For Manta users whose data has never been snaplinked, the system provides an option to trade Manta's snaplink functionality for lower-latency storage space reclamation. While accelerated GC can be deployed in conjuction with the traditional offline GC system, it will only process objects owned by snaplink-disabled accounts. Objects owned by other accounts will continue to be processed by the traditional garbage-collection pipeline.

Manta Scalability

There are many dimensions to scalability.

In the metadata tier:

  • number of objects (scalable with additional shards)
  • number of objects in a directory (fixed, currently at a few million objects)

In the storage tier:

  • total size of data (scalable with additional storage servers)
  • size of data per object (limited to the amount of storage on any single system, typically in the tens of terabytes, which is far larger than is typically practical)

In terms of performance:

  • total bytes in or out per second (depends on network configuration)
  • count of concurrent requests (scalable with additional metadata shards or API servers)
  • count of compute tasks executed per second (scalable with additional storage nodes)
  • count of concurrent compute tasks (could be measured in tasks, CPU cores available, or DRAM availability; scaled with additional storage node hardware)

As described above, for most of these dimensions, Manta can be scaled horizontally by deploying more software instances (often on more hardware). For a few of these, the limits are fixed, but we expect them to be high enough for most purposes. For a few others, the limits are not known, and we've never (or rarely) run into them, but we may need to do additional work when we discover where these limits are.

Planning a Manta deployment

Before even starting to deploy Manta, you must decide:

  • the number of datacenters
  • the number of metadata shards
  • the number of storage and non-storage compute nodes
  • how to lay out the non-storage zones across the fleet

Choosing the number of datacenters

You can deploy Manta across any odd number of datacenters in the same region (i.e., having a reliable low-latency, high-bandwidth network connection among all datacenters). We've only tested one- and three-datacenter configurations. Even-numbered configurations are not supported. See "Other configurations" below for details.

A single-datacenter installation can be made to survive server failure, but obviously cannot survive datacenter failure. The us-central deployment uses three datacenters.

Choosing the number of metadata shards

Recall that each metadata shard has the storage and load capacity of a single postgres instance. If you want more capacity than that, you need more shards. Shards can be added later without downtime, but it's a delicate operation. The us-central deployment uses three metadata shards, plus a separate shard for the compute and storage capacity data.

We recommend at least two shards so that the compute and storage capacity information can be fully separated from the remaining shards, which would be used for metadata.

Choosing the number of storage and non-storage compute nodes

The two classes of node (storage nodes and non-storage nodes) usually have different hardware configurations.

The number of storage nodes needed is a function of the expected data footprint and (secondarily) the desired compute capacity.

The number of non-storage nodes required is a function of the expected load on the metadata tier. Since the point of shards is to distribute load, each shard's postgres instance should be on a separate compute node. So you want at least as many compute nodes as you will have shards. The us-central deployment distributes the other services on those same compute nodes.

For information on the latest recommended production hardware, see Triton Manufacturing Matrix and Triton Manufacturing Bill of Materials.

The us-central deployment uses older versions of the Tenderloin-A for service nodes and Mantis Shrimps for storage nodes.

Choosing how to lay out zones

Since there are so many different Manta components, and they're all deployed redundantly, there are a lot of different pieces to think about. So when setting up a Manta deployment, it's very important to think ahead of time about which components will run where!

The manta-adm genconfig tool (when used with the --from-file option) can be very helpful in laying out zones for Manta. See the manta-adm manual page for details. manta-adm genconfig --from-file takes as input a list of physical servers and information about each one. Large deployments that use Device 42 to manage hardware inventory may find the manta-genazconfig tool useful for constructing the input for manta-adm genconfig.

The most important production configurations are described below, but for reference, here are the principles to keep in mind:

  • Storage zones should only be co-located with marlin zones, and only on storage nodes. Neither makes sense without the other, and we do not recommend combining them with other zones. All other zones should be deployed onto non-storage compute nodes.
  • Nameservice: There must be an odd number of "nameservice" zones in order to achieve consensus, and there should be at least three of them to avoid a single point of failure. There must be at least one in each DC to survive any combination of datacenter partitions, and it's recommended that they be balanced across DCs as much as possible.
  • For the non-sharded, non-ops-related zones (which is everything except moray, postgres, ops, madtom, marlin-dashboard), there should be at least two of each kind of zone in the entire deployment (for availability), and they should not be in the same datacenter (in order to survive a datacenter loss). For single-datacenter deployments, they should at least be on separate compute nodes.
  • Only one madtom and marlin-dashboard zone is considered required. It would be good to provide more than one in separate datacenters (or at least separate compute nodes) for maintaining availability in the face of a datacenter failure.
  • There should only be one ops zone. If it's unavailable for any reason, that will only temporarily affect metering, garbage collection, and reports.
  • For postgres, there should be at least three instances in each shard. For multi-datacenter configurations, these instances should reside in different datacenters. For single-datacenter configurations, they should be on different compute nodes. (Postgres instances from different shards can be on the same compute node, though for performance reasons it would be better to avoid that.)
  • For moray, there should be at least two instances per shard in the entire deployment, and these instances should reside on separate compute nodes (and preferably separate datacenters).
  • garbage-collector zones should not be configured to poll more than 6 shards. Further, they should not be co-located with instances of other CPU-intensive Manta components (e.g. loadbalancer) to avoid interference with the data path.

Most of these constraints are required in order to maintain availability in the event of failure of any component, server, or datacenter. Below are some example configurations.

Example single-datacenter, multi-server configuration

On each storage node, you should deploy one "storage" zone. We recommend deploying 128 "marlin" zones for systems with 256GB of DRAM.

If you have N metadata shards, and assuming you'll be deploying 3 postgres instances in each shard, you'd ideally want to spread these over 3N compute nodes. If you combine instances from multiple shards on the same host, you'll defeat the point of splitting those into shards. If you combine instances from the same shard on the same host, you'll defeat the point of using replication for improved availability.

You should deploy at least two Moray instances for each shard onto separate compute nodes. The remaining services can be spread over the compute nodes in whatever way, as long as you avoid putting two of the same thing onto the same compute node. Here's an example with two shards using six compute nodes:

CN1 CN2 CN3 CN4 CN5 CN6
postgres 1 postgres 1 postgres 1 postgres 2 postgres 2 postgres 2
moray 1 moray 1 electric-moray moray 2 moray 2 electric-moray
jobsupervisor jobsupervisor medusa medusa authcache authcache
nameservice nameservice nameservice webapi webapi webapi
ops marlin-dash madtom loadbalancer loadbalancer loadbalancer
jobpuller jobpuller

In this notation, "postgres 1" and "moray 1" refer to an instance of "postgres" or "moray" for shard 1.

Example three-datacenter configuration

All three datacenters should be in the same region, meaning that they share a reliable, low-latency, high-bandwidth network connection.

On each storage node, you should deploy one "storage" zone. We recommend deploying 128 "marlin" zones for systems with 256GB of DRAM.

As with the single-datacenter configuration, you'll want to spread the postgres instances for N shards across 3N compute nodes, but you'll also want to deploy at least one postgres instance in each datacenter. For four shards, we recommend the following in each datacenter:

CN1 CN2 CN3 CN4
postgres 1 postgres 2 postgres 3 postgres 4
moray 1 moray 2 moray 3 moray 4
nameservice nameservice electric-moray authcache
ops jobsupervisor jobsupervisor webapi
webapi jobpuller loadbalancer loadbalancer
marlin-dashboard madtom

In this notation, "postgres 1" and "moray 1" refer to an instance of "postgres" or "moray" for shard 1.

Other configurations

For testing purposes, it's fine to deploy all of Manta on a single system. Obviously it won't survive server failure. This is not supported for a production deployment.

It's not supported to run Manta in an even number of datacenters since there would be no way to maintain availability in the face of an even split. More specifically:

  • A two-datacenter configuration is possible but cannot survive datacenter failure or partitioning. That's because the metadata tier would require synchronous replication across two datacenters, which cannot be maintained in the face of any datacenter failure or partition. If we relax the synchronous replication constraint, then data would be lost in the event of a datacenter failure, and we'd also have no way to resolve the split-brain problem where both datacenters accept conflicting writes after the partition.
  • For even numbers N >= 4, we could theoretically survive datacenter failure, but any N/2 -- N/2 split would be unresolvable. You'd likely be better off dividing the same hardware into N - 1 datacenters.

It's not supported to run Manta across multiple datacenters not in the same region (i.e., not having a reliable, low-latency, high-bandwidth connection between all pairs of datacenters).

Deploying Manta

Before you get started for anything other than a COAL or lab deployment, be sure to read and fully understand the section on "Planning a Manta deployment" above.

These general instructions should work for anything from COAL to a multi-DC, multi-compute-node deployment. The general process is:

  1. Set up Triton in each datacenter, including the headnode, all Triton services, and all compute nodes you intend to use. For easier management of hosts, we recommend that the hostname reflect the type of server and, possibly, the intended purpose of the host. For example, we use the "RA" or "RM" prefix for "Richmond-A" hosts and "MS" prefix for "Mantis Shrimp" hosts.

  2. In the global zone of each Triton headnode, set up a manta deployment zone using:

     sdcadm post-setup common-external-nics  # enable downloading service images
     sdcadm post-setup manta --mantav1
    
  3. In each datacenter, generate a Manta networking configuration file.

    a. For COAL, from the GZ, use:

     headnode$ /zones/$(vmadm lookup alias=manta0)/root/opt/smartdc/manta-deployment/networking/gen-coal.sh > /var/tmp/netconfig.json
    

    b. For those using the internal Engineering lab, run this from the lab.git repo:

     lab.git$ node bin/genmanta.js -r RIG_NAME LAB_NAME
    
    and copy that to the headnode GZ.
    

    c. For other deployments, see "Networking configuration" below.

  4. Once you've got the networking configuration file, configure networks by running this in the global zone of each Triton headnode:

     headnode$ ln -s /zones/$(vmadm lookup alias=manta0)/root/opt/smartdc/manta-deployment/networking /var/tmp/networking
     headnode$ cd /var/tmp/networking
     headnode$ ./manta-net.sh CONFIG_FILE
    

    This step is idempotent. Note that if you are setting up a multi-DC Manta, ensure that (1) your Triton networks have cross datacenter connectivity and routing set up and (2) the Triton firewalls allow TCP and UDP traffic cross- datacenter.

  5. For multi-datacenter deployments, you must link the datacenters within Triton so that the UFDS database is replicated across all three datacenters.

  6. For multi-datacenter deployments, you must configure SAPI for multi-datacenter support.

  7. If you'll be deploying a loadbalancer on any compute nodes other than a headnode, then you'll need to create the "external" NIC tag on those CNs. For common single-system configurations (for dev and test systems), you don't usually need to do anything for this step. For multi-CN configurations, you probably will need to do this. See the Triton documentation for how to add a NIC tag to a CN.

  8. In each datacenter's manta deployment zone, run the following:

     manta$ manta-init -s SIZE -e YOUR_EMAIL
    

    manta-init must not be run concurrently in multiple datacenters.

    SIZE must be one of "coal", "lab", or "production". YOUR_EMAIL is used to create an Amon contact for alarm notifications.

    This step runs various initialization steps, including downloading all of the zone images required to deploy Manta. This can take a while the first time you run it, so you may want to run it in a screen session. It's idempotent.

    A common failure mode for those without quite fast internet links is a failure to import the "manta-marlin" image. The manta-marlin image is the multi-GB image that is used for zones in which Manta compute jobs run. See the "Workaround for manta-marlin image import failure" section below.

  9. In each datacenter, run the following from the headnode global zone to install the "marlin" agent to each server. The marlin agent is a part of the Manta compute jobs system.

    # Download the latest marlin agent image.
    channel=$(sdcadm channel get)
    marlin_agent_image=$(updates-imgadm -C $channel list name=marlin --latest -H -o uuid)
    marlin_agent_file=/var/tmp/marlin-agent-$marlin_agent_image.tgz
    if [[ -f $marlin_agent_file ]]; then
        echo "Already have $marlin_agent_file"
    else
        updates-imgadm -C $channel get-file $marlin_agent_image -o $marlin_agent_file
    fi
    
    # Copy to all other servers.
    sdc-oneachnode -c -X -d /var/tmp -g $marlin_agent_file
    
    # Install the agent on all servers.
    sdc-oneachnode -a "apm install $marlin_agent_file >>/var/log/marlin-agent-install.log 2>&1 && echo success || echo 'fail (see /var/log/marlin-agent-install.log)'"
    
    # Sanity check to ensure marlin agent is running on all storage servers.
    sdc-oneachnode -a 'svcs -H marlin-agent'

    Note: If new servers are added to be storage nodes, the above steps must be run for each new server to install the "marlin" agent. Otherwise compute jobs will not use that server.

  10. In each datacenter's manta deployment zone, deploy Manta components.

    a. In COAL, just run manta-deploy-coal. This step is idempotent.

    b. For a lab machine, just run manta-deploy-lab. This step is idempotent.

    c. For any other installation (including a multi-CN installation), you'll need to run several more steps: assign shards for storage and object metadata with "manta-shardadm"; create a hash ring with "manta-adm create-topology"; generate a "manta-adm" configuration file (see "manta-adm configuration" below); and finally run "manta-adm update config.json" to deploy those zones. Your best bet is to examine the "manta-deploy-dev" script to see how it uses these tools. See "manta-adm configuration" below for details on the input file to "manta-adm update". Each of these steps is idempotent, but the shard and hash ring must be set up before deploying any zones.

  11. If desired, set up connectivity to the "ops", "marlin-dashboard", and "madtom" zones. See "Overview of Operating Manta" below for details.

  12. For multi-datacenter deployments, set the MUSKIE_MULTI_DC SAPI property. This is required to enforce that object writes are distributed to multiple datacenters. In the SAPI master datacenter:

    headnode $ app_uuid="$(sdc-sapi /applications?name=manta | json -Ha uuid)"
    headnode $ echo '{ "metadata": { "MUSKIE_MULTI_DC": true } }' | \
        sapiadm update "$app_uuid"
    

    Repeat the following in each datacenter.

    headnode $ manta-oneach -s webapi 'svcadm restart "*muskie*"'
    
  13. If you wish to enable basic monitoring, run the following in each datacenter:

    manta-adm alarm config update
    

    to deploy Amon probes and probe groups shipped with Manta. This will cause alarms to be opened when parts of Manta are not functioning. Email notifications are enabled by default using the address provided to manta-init above. (Email notifications only work after you have configured the Amon service for sending email.) If you want to be notified about alarm events via XMPP, see "Changing alarm contact methods" below.

  14. In development environments with more than one storage zone on a single system, it may be useful to apply quotas to storage zones so that if the system fills up, there remains space in the storage pool to address the problem. You can do this by finding the total size of the storage pool using zfs list zones in the global zone:

    # zfs list zones
    NAME    USED  AVAIL  REFER  MOUNTPOINT
    zones  77.5G   395G   612K  /zones
    

    Determine how much you want to allow the storage zones to use. In this case, we'll allow the zones to use 100 GiB each, making up 300 GiB, or 75% of the available storage. Now, find the storage zones:

    # manta-adm show storage
    SERVICE          SH ZONENAME                             GZ ADMIN IP
    storage           1 15711409-ca77-4204-b733-1058f14997c1 172.25.10.4
    storage           1 275dd052-5592-45aa-a371-5cd749dba3b1 172.25.10.4
    storage           1 b6d2c71f-ec3d-497f-8b0e-25f970cb2078 172.25.10.4
    

    and for each one, update the quota using vmadm update. You can apply a 100 GiB quota to all of the storage zones on a single-system Manta using:

    manta-adm show -H -o zonename storage | while read zonename; do
        vmadm update $zonename quota=100; done
    

    Note: This only prevents Manta storage zones from using more disk space than you've budgeted for them. If the rest of the system uses more than expected, you could still run out of disk space. To avoid this, make sure that all zones have quotas and the sum of quotas does not exceed the space available on the system.

    Background: Manta operators are responsible for basic monitoring of components, including monitoring disk usage to avoid components running out of disk space. Out of the box, Manta stops using storage zones that are nearly full. This mechanism relies on storage zones reporting how full they are, which they determine by dividing used space by available space. However, Manta storage zones are typically deployed without quotas, which means their available space is limited by the total space in the ZFS storage pool. This accounting is not correct when there are multiple storage zones on the same system.

    To make this concrete, consider a system with 400 GiB of total ZFS pool space. Suppose there are three storage zones, each using 100 GiB of space, and suppose that the rest of the system uses negligible storage space. In this case, there are 300 GiB in use overall, so there's 100 GiB available in the pool. As a result, each zone reports that it's using 100 GiB and has 100 GiB available, so it's 50% full. In reality, though, the system is 75% full. Each zone reports that it has 100 GiB free, but if we were to write just 33 GiB to each zone, the whole system would be full.

    This problem only affects deployments that place multiple storage zones on the same system, which is not typical outside of development. In development, the problem can be worked around by applying appropriate quotas in each zone (as described above).

Post-Deployment Steps

Once the above steps have been completed, there are a few steps you should consider doing to ensure a working deployment.

Prerequisites

If you haven't already done so, you will need to install the Manta CLI tools.

Set up a Manta Account

To test Manta with the Manta CLI tools, you will need an account configured in Triton. You can either use one of the default configured accounts or setup your own. The most common method is to test using the poseidon account which is created by the Manta install.

In either case, you will need access to the Operations Portal. See the instructions here on how to find the IP address of the Operations Portal from your headnode.

Log into the Operations Portal:

  • COAL users should use login admin and the password you initially setup.
  • Lab users will also use admin, but need to ask whoever provisioned your lab account for the password.

Once in, follow these instructions to add ssh keys to the account of your choice.

Test Manta from the CLI Tools

Once you have setup an account on Manta or added your ssh keys added to an existing account, you can test your Manta install with the Manta CLI tools you installed above in "Prerequisites".

There are complete instructions on how to get started with the CLI tools on the apidocs page.

Some things in that guide will not be as clear for users of custom deployments.

  1. The biggest difference will be the setting of the MANTA_URL variable. You will need to find the IP address of your API endpoint. To do this from your headnode:

     headnode$ manta-adm show -H -o primary_ip loadbalancer
    

    Multiple addresses will be returned. Choose any one and set MANTA_URL to https://$that_ip.

  2. MANTA_USER will be the account you setup in "Set up a Manta Account" section.

  3. MANTA_KEY_ID will be the ssh key id you added in "Set up a Manta Account" section.

  4. If the key you used is in an environment that has not installed a certificate signed by a recognized authority you might see Error: self signed certificate errors. To fix this, add MANTA_TLS_INSECURE=true to your environment or shell config.

A final ~/.bashrc or ~/.bash_profile might look something like:

export MANTA_USER=poseidon
export MANTA_URL=https://<your-loadbalancer-ip>
export MANTA_TLS_INSECURE=true
export MANTA_KEY_ID=`ssh-keygen -l -f ~/.ssh/id_rsa.pub | awk '{print $2}' | tr -d '\n'`

Lastly test the CLI tools from your development machine:

$ echo "Hello, Manta" > /tmp/hello.txt
$ mput -f /tmp/hello.txt ~~/stor/hello-foo
.../stor/hello-foo          [=======================================================>] 100%      13B
$ mls ~~/stor/
hello-foo
$ mget ~~/stor/hello-foo
Hello, Manta

Deploy Accelerated Garbage Collection Components

While the manta-init deployment script creates the SAPI garbage-collector service, none of the default deployment layouts include any garbage-collector instances. The accelerated gc functionality is intended to be an extension to Manta, so it is deployed and managed with a separate manta-adm mode after the core Manta components are deployed.

Verify objects have no snaplinks

Accelerated garbage-collection can only be used on objects that have no snaplinks. Part of the deployment sequence involves disabling snaplinks for a given user that owns the objects to be deleted. In order to prevent dangling metadata references, it is important first verify that the user does not own any snaplinked objects. Manta does not have tools to run this check, but one method for doing it might be to scan the Manta table on each shard primary to generate a mapping from objectid to _key and search for collisions.

Get the latest garbage-collector zone image

headnode$ updates-imgadm import name=manta-garbage-collector --latest

Enable Accelerated Garbage-collection (Disable snaplinks)

In order to leverage accelerated garbage-collection, there must be at least one snaplink-disabled user in the Manta deployment. Snaplinks can be disabled using the following command:

headnode$ manta-adm accel-gc enable <ACCOUNT-LOGIN>

To list the accounts for which accelerated garbage-collection is enabled run:

headnode$ manta-adm accel-gc accounts

To disable accelerated garbage-collection (re-enabling snaplinks) for an account:

headnode$ manta-adm accel-gc disable <ACCOUNT-LOGIN>

Since the snaplink-disabled bit for each account is stored as SAPI application metadata, enabling/disabling snaplinks requires restarting all Muskies in the Manta deployment:

headnode$ manta-oneach -s webapi "svcadm restart
    svc:/manta/application/muskie:muskie-*"

It is also necessary to restart garbage-collectors for the same reason.

headnode$ manta-oneach -s garbage-collector
    "svcadm restart garbage-collector"

After the restarts are complete any requests to snaplink objects owned by <ACCOUNT-LOGIN> will fail with a 403 error, and all successful requests to delete objects created and/or owned by the snaplink-disabled account will result in new records added to the manta_fastdelete_queue (and not the manta_delete_log).

To summarize, objects owned by a snaplink-disabled account will not be garbage-collected until:

  • Shards storing those objects are assigned to some garbage-collector
  • The account has been marked snaplink-disabled, and that bit of configuration is picked up by all the Muskies and Garbage-collectors in the delpoyment.

A simple check that the accelerated GC pipeline is working can be done by logging into an index shard and continuously monitoring the size of the manta_fastdelete_queue while issuing requests to delete objects owned by the snaplink-disabled account. Depending on the batch size configuration of the garbage-collectors in the local Manta deployment, it could take roughly 30 seconds for records inserted into the queue to be removed, but the queue should always eventually be cleared.

For more sophisticated progress monitoring, see the node-artedi probes described in the garbage-collector readme.

Deploy garbage-collector zones

garbage-collector instances are deployed just like any other Manta service: by including them in the service layout configuration and then passing the config to manta-adm update. The manta-adm accel-gc mode includes its own genconfig-like functionality that layers any number of garbage-collector instances onto the current deployment configuration. To generate a configuration with N garbage-collectors, run:

headnode$ IMAGEUUID=$(sdc-imgadm list -H -o uuid name=manta-garbage-collector --latest)
headnode$ manta-adm accel-gc genconfig $IMAGEUUID N &> config-with-gc.json

manta-adm accel-gc genconfig functions similarly to manta-adm genconfig which print a json-formatted deployment layout. Like manta-adm genconfig, the 'gc' variant also enforces a set of criteria for choosing which compute nodes to add garbage-collector instance to:

  • manta-adm accel-gc will avoid co-locating garbage-collector zones with either loadbalancer zones or nameservice zones. For delete-heavy workloads, garbage-collector instances will generally use more CPU and memory resources. Separating theses process from system-critical services is a precaution against catastrophic service degradation.
  • manta-adm accel-gc will evenly distribute the garbage-collector zones among the CNs with least number of deployed instances that also meet the first criterion

If the above constraints cannot be met in a deployment (this will likely be the case for most non-production deployments), the operator may choose to ignore them with the --ignore-criteria or -i switch:

headnode$ manta-adm accel-gc genconfig -i $IMAGEUUID N &> config-with-gc.json

Once you are happy with configuration, run:

headnode$ manta-adm update config-with-gc.json

to deploy the new zones.

Verify garbage-collectors are running

Each garbage-collector instance runs an administration server that can be used to extract information about its configuration. A good check to run to ensure that the previous step went well is:

headnode$ manta-oneach -s garbage-collector \
    "curl -X GET localhost:2020/shards"

Here is an example of what the output of the previous command might look like in a deployment with two garbage-collectors:

SERVICE          ZONE     OUTPUT
garbage-collector 2bd87578 []
garbage-collector 756e17e4 []

The output indicates that the collectors are healthy, but they haven't been tasked with processing records from any shards yet.

Assign shards to collectors

By default, garbage-collectors come up idle because they are not configured to read records from any metadata shards. The configuration that indicates which collector should process records from which shard is stored in each collector's SAPI instance object. The benefit of storing this mapping in each instance object is that the configuration survives reprovisions and can be managed with the existing SAPI interface.

Each garbage-collector instance object stores a list of index shard identifiers in a metadata array called GC_ASSIGNED_SHARDS. This array can be modified by hand with tools like sapiadm and sdc-sapi, but it is simpler to manage this configuration with manta-adm accel-gc. Once the collectors have been deployed, run:

headnode$ manta-adm accel-gc gen-shard-assignment &> gc-shard-assignment.json

to generate a mapping from garbage-collector instances to index shards. The operation will distribute all of the index shards in the Manta deployment to the garbage-collectors as evenly as possible. An example of how such a configuration would look on a deployment with two index shards is:

{
    "756e17e4-bfd2-4bc5-b3b6-166480dd0614": {
        "GC_ASSIGNED_SHARDS": [
            {
                "host": "2.moray.orbit.example.com",
                "last": true
            }
        ]
    },
    "2bd87578-285f-4937-aff1-181a7f32dcc0": {
        "GC_ASSIGNED_SHARDS": [
            {
                "host": "1.moray.orbit.example.com",
                "last": true
            }
        ]
    }
}

Note that data in the GC_ASSIGNED_SHARDS array comes from the SAPI Manta application's INDEX_MORAY_SHARDS metadata field.

The keys of the object are uuids of garbage-collector instances, and the values are changes that will be applied to the metadata blob of the corresponding instance's SAPI object. The last object in the GC_ASSIGNED_SHARDS array must have a "last" field set to true. This is a convention used by Manta's templating engine.

Update garbage-collector metadata

Once you've generated a mapping from collectors to shards and saved it in gc-shard-assignment.json, run:

headnode$ manta-adm accel-gc update gc-shard-assignment.json

to update the corresponding SAPI metadata. There are a few points to stress here:

  • This operation replaces any existing mapping from garbage-collectors to shards. This behavior makes the assignment idempotent, which mimics the semantics of manta-adm update.
  • Running this operation does not immediately cause the garbage-collectors to start reading records from the newly assigned shards. That requires restarting all the modified collectors.

If new shards or new garbage-collectors are added to the Manta deployment, you may run manta-adm accel-gc gen-shard-assignment followed by manta-adm accel-gc update to re-assign the new pool of shards to the new pool of collectors.

Restart garbage-collectors

In order for the configuration change in the previous step to take effect, you will need to restart the garbage-collectors. This can be done with the following manta-oneach query:

headnode$ manta-oneach -s garbage-collector
    "svcadm restart config-agent; sleep 5; svcadm restart garbage-collector"

Restarting the config-agent ensures that the changes have been picked up and that the local process configuration file has been re-rendered. After the garbage-collectors have been restarted, you can verify the shard assignment took effect with:

headnode$ manta-oneach -s garbage-collector "curl -X GET localhost:2020/shards"

Which should now show the following output:

SERVICE          ZONE     OUTPUT
garbage-collector 2bd87578 [{"host":"1.moray.orbit.example.com"}]
garbage-collector 756e17e4 [{"host":"2.moray.orbit.example.com"}]

Networking configuration

The networking configuration file is a per-datacenter JSON file with several properties:

Property Kind Description
azs array of strings list of all availability zones (datacenters) participating in Manta in this region
this_az string string (in azs) denoting this availability zone
manta_nodes array of strings list of server uuid's for all servers participating in Manta in this AZ
marlin_nodes array of strings list of server uuid's (subset of manta_nodes) that are storage nodes
admin object describes the "admin" network in this datacenter (see below)
manta object describes the "manta" network in this datacenter (see below)
marlin object describes the "marlin" network in this datacenter (see below)
nic_mappings object maps each server in manta_nodes to an object mapping each network name ("manta" and "marlin") to the network interface on the server that should be tagged
mac_mappings object (deprecated) maps each server uuid from manta_nodes to an object mapping each network name ("admin", "manta", and "marlin") to the MAC address on that server over which that network should be created.
distribute_svcs boolean control switch over boot-time networking detection performed by manta-net.sh (see below)

"admin", "manta", and "marlin" all describe these networks that are built into Manta:

  • admin: the Triton administrative network
  • manta: the Manta administrative network, used for high-volume communication between all Manta services.
  • marlin: the network used for compute zones. This is usually a network that gives out private IPs that are NAT'd to the internet so that users can contact the internet from Marlin jobs, but without needing their own public IP for each zone.

Each of these is an object with several properties:

Property Kind Description
network string Name for the Triton network object (usually the same as the network name)
nic_tag string NIC tag name for this network (usually the same as the network name)

Besides those two, each of these blocks has a property for the current availability zone that describes the "subnet", "gateway", "vlan_id", and "start" and "end" provisionable addresses.

nic_mappings is a nested object structure defining the network interface to be tagged for each server defined in manta_nodes, and for each of Manta's required networks. See below for an example of this section of the configuration.

Note: If aggregations are used, they must already exist in NAPI, and updating NIC tags on aggregations will require a reboot of the server in question.

The optional boolean distribute_svcs gives control to the operator over the boot-time networking detection that happens each time manta-net.sh is executed (which determines if the global zone SMF services should be distributed). For example, an operator has enabled boot-time networking in a datacenter after installing Manta, and subsequently would like to add some more storage nodes. For consistency, the operator can set distribute_svcs to true in order to force distribution of these global zone services.

Note: For global zone network changes handled by boot-time networking to take effect, a reboot of the node must be performed. See Triton's virtual networking documentation for more information on boot-time networking.

For reference, here's an example multi-datacenter configuration with one service node (aac3c402-3047-11e3-b451-002590c57864) and one storage node (445aab6c-3048-11e3-9816-002590c3f3bc):

{
  "this_az": "staging-1",

  "manta_nodes": [
    "aac3c402-3047-11e3-b451-002590c57864",
    "445aab6c-3048-11e3-9816-002590c3f3bc"
  ],
  "marlin_nodes": [
    "445aab6c-3048-11e3-9816-002590c3f3bc"
  ],
  "azs": [
    "staging-1",
    "staging-2",
    "staging-3"
  ],

  "admin": {
    "nic_tag": "admin",
    "network": "admin",
    "staging-1": {
      "subnet": "172.25.3.0/24",
      "gateway": "172.25.3.1"
    },
    "staging-2": {
      "subnet": "172.25.4.0/24",
      "gateway": "172.25.4.1"
    },
    "staging-3": {
      "subnet": "172.25.5.0/24",
      "gateway": "172.25.5.1"
    }
  },

  "manta": {
    "nic_tag": "manta",
    "network": "manta",
    "staging-1": {
      "vlan_id": 3603,
      "subnet": "172.27.3.0/24",
      "start": "172.27.3.4",
      "end": "172.27.3.254",
      "gateway": "172.27.3.1"
    },
    "staging-2": {
      "vlan_id": 3604,
      "subnet": "172.27.4.0/24",
      "start": "172.27.4.4",
      "end": "172.27.4.254",
      "gateway": "172.27.4.1"
    },
    "staging-3": {
      "vlan_id": 3605,
      "subnet": "172.27.5.0/24",
      "start": "172.27.5.4",
      "end": "172.27.5.254",
      "gateway": "172.27.5.1"
    }
  },

  "marlin": {
    "nic_tag": "mantanat",
    "network": "mantanat",
    "staging-1": {
      "vlan_id": 3903,
      "subnet": "172.28.64.0/19",
      "start": "172.28.64.4",
      "end": "172.28.95.254",
      "gateway": "172.28.64.1"
    },
    "staging-2": {
      "vlan_id": 3904,
      "subnet": "172.28.96.0/19",
      "start": "172.28.96.4",
      "end": "172.28.127.254",
      "gateway": "172.28.96.1"
    },
    "staging-3": {
      "vlan_id": 3905,
      "subnet": "172.28.128.0/19",
      "start": "172.28.128.4",
      "end": "172.28.159.254",
      "gateway": "172.28.128.1"
    }
  },

  "nic_mappings": {
    "aac3c402-3047-11e3-b451-002590c57864": {
      "manta": {
        "mac": "90:e2:ba:4b:ec:d1"
      }
    },
    "445aab6c-3048-11e3-9816-002590c3f3bc": {
      "manta": {
        "mac": "90:e2:ba:4a:32:71"
      },
      "mantanat": {
        "aggr": "aggr0"
      }
    }
  }

The deprecated mac_mappings can also be used in place of nic_mappings. Only one of nic_mappings or mac_mappings is supported per network configuration file.

In a multi-datacenter configuration, this would be used to configure the "staging-1" datacenter. There would be two more configuration files, one for "staging-2" and one for "staging-3".

Workaround for manta-marlin image import failure

A common failure mode with manta-init ... for those without a fast internet link is a failure to import the large "manta-marlin" image. This is a multi-GB image used for the zones in which Manta compute jobs run. The problem is that the large image can make it easy to hit the one hour timeout for the IMGAPI AdminImportRemoteImage endpoint used to import Manta images. Neither this endpoint nor the https://updates.tritondatacenter.com server hosting the images supports resumable downloads.

Here is a manual workaround (run the following from the headnode global zone):

cd /var/tmp

# Determine the UUID of the latest "manta-marlin" image on updates.tritondatacenter.com.
muuid=$(updates-imgadm list name=manta-marlin --latest -H -o uuid)

# Download directly from a separate manual download area in Manta.
curl -kO https://us-central.manta.mnx.io/Joyent_Dev/public/Manta/manta-marlin-image/$muuid.imgmanifest

# First ensure that the origin (i.e. parent) image is installed
origin=$(json -f $muuid.imgmanifest origin)
[[ -z "$origin" ]] \
    || sdc-imgadm get $origin >/dev/null \
    || sdc-imgadm import $origin -S https://updates.tritondatacenter.com

# If that failed, then the separate download area doesn't have a recent
# image. Please log an issue.
[[ $? -ne 0 ]] && echo log an issue at https://github.com/TritonDataCenter/manta/issues/

# If the following is interrupted, then re-run the same command to resume:
curl -kO -C - https://us-central.manta.mnx.io/Joyent_Dev/public/Manta/manta-marlin-image/$muuid.file.gz

# Verify the download checksum
[[ $(json -f $muuid.imgmanifest | json files.0.sha1) \
    == $(openssl dgst -sha1 $muuid.file.gz | awk '{print $2}') ]] \
    || echo "error downloading, please delete and retry"

# Then install this image into the DC's IMGAPI:
sdc-imgadm import -m $muuid.imgmanifest -f $muuid.file.gz

manta-adm configuration

"manta-adm" is the tool we use both to deploy all of the Manta zones and then to provision new zones, deprovision old zones, or reprovision old zones with a new image. "manta-adm" also has commands for viewing what's deployed, showing information about compute nodes, and more, but this section only discusses the configuration file format.

A manta-adm configuration file takes the form:

{
    "COMPUTE_NODE_UUID": {
        "SERVICE_NAME": {
            "IMAGE_UUID": COUNT_OF_ZONES
        },
        "SHARDED_SERVICE_NAME": {
            "SHARD_NUMBER": {
                "IMAGE_UUID": COUNT_OF_ZONES
            },
        }
    },
}

The file specifies how many of each kind of zone should be deployed on each compute node. For most zones, the "kind" of zone is just the service name (e.g., "storage"). For sharded zones, you also have to specify the shard number.

After you've run manta-init, you can generate a sample configuration for a single-system install using "manta-adm genconfig". Use that to give you an idea of what this looks like:

$ manta-adm genconfig coal
{
    "<any>": {
        "nameservice": {
            "197e905a-d15d-11e3-90e2-6bf8f0ea92b3": 1
        },
        "postgres": {
            "1": {
                "92782f28-d236-11e3-9e6c-5f7613a4df37": 2
            }
        },
        "moray": {
            "1": {
                "ef659002-d15c-11e3-a5f6-4bf577839d16": 1
            }
        },
        "electric-moray": {
            "e1043ddc-ca82-11e3-950a-ff14d493eebf": 1
        },
        "storage": {
            "2306b44a-d15d-11e3-8f87-6b9768efe5ae": 2
        },
        "authcache": {
            "5dff63a4-d15c-11e3-a312-5f3ea4981729": 1
        },
        "webapi": {
            "319afbfa-d15e-11e3-9aa9-33ebf012af8f": 1
        },
        "loadbalancer": {
            "7aac4c88-d15c-11e3-9ea6-dff0b07f5db1": 1
        },
        "jobsupervisor": {
            "7cf43bb2-d16c-11e3-b157-cb0adb200998": 1
        },
        "jobpuller": {
            "1b0f00e4-ca9b-11e3-ba7f-8723c9cd3ce7": 1
        },
        "medusa": {
            "bb6e5424-d0bb-11e3-8527-676eccc2a50a": 1
        },
        "ops": {
            "ab253aae-d15d-11e3-8f58-3fb986ce12b3": 1
        },
        "marlin": {
            "1c18ae6e-cf70-473a-a22c-f3536d6ea789": 2
        }
    }
}

This file effectively specifies all of the Manta components except for the platforms and Marlin agents.

You can generate a configuration file that describes your current deployment with manta-adm show -s -j.

For a coal or lab deployment, your best bet is to save the output of manta-adm genconfig coal or manta-adm genconfig lab to a file and use that. This is what the manta-deploy-coal and manta-deploy-lab scripts do, and you may as well just use those.

Once you have a file like this, you can pass it to manta-adm update, which will show you what it will do in order to make the deployment match the configuration file, and then it will go ahead and do it. For more information, see "manta-adm help update".

Upgrading Manta components

Manta services upgrades

There are two distinct methods of updating instances: you may deploy additional instances, or you may reprovision existing instances.

With the first method (new instances), additional instances are provisioned using a newer image. This approach allows you to add additional capacity without disrupting extant instances, and may prove useful when an operator needs to validate a new version of a service before adding it to the fleet.

With the second method (reprovision), this update will swap one image out for a newer image, while preserving any data in the instance's delegated dataset. Any data or customizations in the instance's main dataset, i.e. zones/UUID, will be lost. Services which have persistent state (manatee, mako, redis) must use this method to avoid discarding their data. This update moves the service offline for 15-30 seconds. If the image onto which an image is reprovisioned doesn't work, the instance can be reprovisioned back to its original image.

This procedure uses "manta-adm" to do the upgrade, which uses the reprovisioning method for all zones other than the "marlin" zones.

Prerequisites

  1. Figure out which image you want to install. You can list available images by running updates-imgadm:

     headnode$ updates-imgadm list name=manta-jobsupervisor | tail
    

    Replace manta-jobsupervisor with some other image name, or leave it off to see all images. Typically you'll want the most recent one. Note the uuid of the image in the first column.

  2. Figure out which zones you want to reprovision. In the headnode GZ of a given datacenter, you can enumerate the zones and versions for a given manta_role using:

     headnode$ manta-adm show jobsupervisor
    

    You'll want to note the VM UUIDs for the instances you want to update.

Procedure

Run this in each datacenter:

  1. Download updated images. The supported approach is to re-run the manta-init command that you used when initially deploying Manta inside the manta-deployment zone. For us-central, use:

     $ manta-init -e manta+us-central@tritondatacenter.com -s production -c 10
    

    Do not run manta-init concurrently in multiple datacenters.

  2. Inside the Manta deployment zone, generate a configuration file representing the current deployment state:

     $ manta-adm show -s -j > config.json
    
  3. Modify the file as desired. See "manta-adm configuration" above for details on the format. In most cases, you just need to change the image uuid for the service that you're updating. You can get the latest image for a service with the following command:

     $ sdc-sapi "/services?name=[service name]&include_master=true" | \
         json -Ha params.image_uuid
    
  4. Pass the updated file to manta-adm update:

     $ manta-adm update config.json
    

    Do not run manta-adm update concurrently in multiple datacenters.

  5. Update the alarm configuration as needed. See "Amon Alarm Updates" below for details.

Marlin agent upgrades

Run this procedure for each datacenter whose Marlin agents you want to upgrade.

  1. Find the build you want to use using updates-imgadm in the global-zone of the headnode.

     headnode$ updates-imgadm list name=marlin
    
  2. Fetch the desired tarball to /var/tmp on the headnode. The file will be named <UUID>-file.gz.

     headnode$ uuid=<UUID>
     headnode$ cd /var/tmp
     headnode$ updates-imgadm get-file -O "$uuid"
    
  3. Copy the tarball to each of the storage nodes:

     headnode$ manta-oneach -G -s storage \
         -d /var/tmp -g "/var/tmp/${uuid}-file.gz"
    
  4. Apply the update to all shrimps with:

     headnode$ manta-oneach -G -s storage \
         "/opt/smartdc/agents/bin/apm install /var/tmp/${uuid}-file.gz"
    
  5. Verify that agents are online with:

     headnode$ manta-oneach -G -s storage 'svcs marlin-agent'
    
  6. Make sure most of the compute zones become ready. For this, you can use the dashboard or run this periodically:

     headnode$ manta-oneach -G -s storage mrzones
    

Note: The "apm install" operation will restart the marlin-agent service, which has the impact of aborting any currently-running tasks on that system and causi