Manta (v1) is an internet-facing object store with in-situ Unix-based compute as a first class operation. The user interface to Manta is essentially:
- A filesystem-like namespace, with directories and objects, accessible over HTTP
- Objects are arbitrary-size blobs of data
- Users can use standard HTTP
PUT
/GET
/DELETE
verbs to create and remove directories and objects as well as to list directories - Users can fetch arbitrary ranges of an object, but may not modify an object except by replacing it
- Users submit map-reduce compute jobs that run arbitrary Unix programs on their objects.
Users can interact with Manta through the official Node.js CLI; the Triton user portal; the Node, Python, Ruby, or Java SDKs; curl(1); or any web browser.
For more information, see the official public user documentation. Before reading this document, you should be very familiar with using Manta, including both the CLI tools and the compute jobs features. You should also be comfortable with all the reference material on how the system works from a user's perspective.
(Note: This is the operator guide for Manta v1. See this document for information on mantav1 vs mantav2. If you are using mantav2, please see the Mantav2 Operator Guide.)
- Architecture basics
- Design constraints
- Basic terminology
- Manta and Triton (SDC)
- Components of Manta
- Services, instances, and agents
- Consensus and internal service discovery
- External service discovery
- Storage tier
- Metadata tier
- The front door
- Compute tier (a.k.a., Marlin)
- Multipart uploads
- Garbage collection, auditing, and metering
- Accelerated Garbage Collection
- Manta Scalability
- Planning a Manta deployment
- Deploying Manta
- Upgrading Manta components
- Overview of Operating Manta
- Debugging: general tasks
- Debugging Marlin: distributed state
- List running jobs
- List recently completed jobs
- Fetch details about a job
- List job inputs, outputs, retries, and errors (as a user would see them)
- Fetch summary of errors for a job
- List tasks not-yet-completed for a given job (and see where they're running)
- See the history of a given task
- Using the Moray tools to fetch detailed state
- Figuring out which jobsupervisor is managing a job
- Troubleshooting Accelerated Garbage Collection
- Debugging Marlin: storage nodes
- Debugging Marlin: anticipated frequent issues
- Users want more information about job progress
- Users observe non-zero count of errors from "mjob get"
- Users observe non-zero count of retries from "mjob get"
- User observes job in "queued" state
- Job hung (not making forward progress)
- Poor job performance
- Error: User Task Error (for unknown reason)
- Error: Task Init Error
- Error: Internal Error
- Zones not resetting
- Debugging Marlin: Zones
- Controlling Marlin
- Advanced deployment notes
- Advanced Marlin Reference
Horizontal scalability. It must be possible to add more hardware to scale any component within Manta without downtime. As a result of this constraint, there are multiple instances of each service.
Strong consistency. In the face of network partitions where it's not possible to remain both consistent and available, Manta chooses consistency. So if all three datacenters in a three-DC deployment become partitioned from one another, requests may fail rather than serve potentially incorrect data.
High availability. Manta must survive failure of any service, physical server, rack, or even an entire datacenter, assuming it's been deployed appropriately. Development installs of Manta can fit on a single system, and obviously those don't survive server failure, but several production deployments span three datacenters and survive partitioning or failure of an entire datacenter without downtime for the other two.
We use nodes to refer to physical servers. Compute nodes mean the same thing they mean in Triton, which is any physical server that's not a head node. Storage nodes are compute nodes that are designated to store actual Manta objects. These are the same servers that run users' compute jobs, but we don't call those compute nodes because that would be confusing with the Triton terminology.
A Manta install uses:
- a headnode (see "Manta and Triton" below)
- one or more storage nodes to store user objects and run compute jobs
- one or more non-storage compute nodes for the other Manta services.
We use the term datacenter (or DC) to refer to an availability zone (or AZ). Each datacenter represents a single Triton deployment (see below). Manta supports being deployed in either 1 or 3 datacenters within a single region, which is a group of datacenters having a high-bandwidth, low-latency network connection.
Manta is built atop Triton (formerly known as SmartDataCenter). A three-datacenter deployment of Manta is built atop three separate Triton deployments. The presence of Manta does not change the way Triton is deployed or operated. Administrators still have AdminUI, APIs, and they're still responsible for managing the Triton services, platform versions, and the like through the normal Triton mechanisms.
All user-facing Manta functionality can be divided into a few major subsystems:
- The storage tier is responsible for storing the physical copies of user objects on disk. Storage nodes store objects as files with random uuids. So within each storage node, the objects themselves are effectively just large, opaque blobs of data.
- The metadata tier is responsible for storing metadata about each object that's visible from the public Manta API. This metadata includes the set of storage nodes on which the physical copy is stored.
- The jobs subsystem (also called Marlin) is responsible for executing user programs on the objects stored in the storage tier.
In order to make all this work, there are several other pieces:
- The front door is made up of the SSL terminators, load balancers, and API servers that actually handle user HTTP requests. All user interaction with Manta happens over HTTP (even compute jobs), so the front door handles all user-facing operations.
- An authentication cache maintains a read-only copy of the UFDS account database. All front door requests are authenticated against this cache.
- A garbage collection and auditing system periodically compares the contents of the metadata tier with the contents of the storage tier to identify deleted objects, remove them, and verify that all other objects are replicated as they should be.
- An accelerated garbage collection system that leverages some additional constraints on object uploads to allow for faster space reclamation.
- A metering system periodically processes log files generated by the rest of the system to produce reports that are ultimately turned into invoices.
- A couple of dashboards provide visibility into what the system is doing at any given point.
- A consensus layer is used to keep track of primary-secondary relationships in the metadata tier.
- DNS-based nameservices are used to keep track of all instances of all services in the system.
Just like with Triton, components are divided into services, instances, and agents. Services and instances are SAPI concepts.
A service is a group of instances of the same kind. For example, "jobsupervisor" is a service, and there may be multiple jobsupervisor zones. Each zone is an instance of the "jobsupervisor" service. The vast majority of Manta components are service instances, and there are several different services involved.
Agents are components that run in the global zone. Manta uses one agent on each storage node called the marlin agent in order to manage the execution of user compute jobs on each storage node.
Note: Do not confuse SAPI services with SMF services. We're talking about SAPI services here. A given SAPI instance (which is a zone) may have many SMF services.
Kind | Major subsystem | Service | Purpose | Components |
---|---|---|---|---|
Service | Consensus | nameservice | Service discovery | ZooKeeper, binder (DNS) |
Service | Front door | loadbalancer | SSL termination and load balancing | stud, haproxy/muppet |
Service | Front door | webapi | Manta HTTP API server | muskie |
Service | Front door | authcache | Authentication cache | mahi (redis) |
Service | Metadata | postgres | Metadata storage and replication | postgres, manatee |
Service | Metadata | moray | Key-value store | moray |
Service | Metadata | electric-moray | Consistent hashing (sharding) | electric-moray |
Service | Storage | storage | Object storage and capacity reporting | mako (nginx), minnow |
Service | Operations | ops | GC, audit, and metering cron jobs | mola, mackerel |
Service | Operations | madtom | Web-based Manta monitoring | madtom |
Service | Operations | marlin-dashboard | Web-based Marlin monitoring | marlin-dashboard |
Service | Compute | jobsupervisor | Distributed job orchestration | jobsupervisor |
Service | Compute | jobpuller | Job archival | wrasse |
Service | Compute | marlin | Compute containers for end users | marlin-lackey |
Agent | Compute | marlin-agent | Job execution on each storage node | marlin-agent |
Agent | Compute | medusa | Interactive Session Engine | medusa |
In some sense, the heart of Manta (and Triton) is a service discovery mechanism (based on ZooKeeper) for keeping track of which service instances are running. In a nutshell, this works like this:
- Setup: There are 3-5 "nameservice" zones deployed that form a ZooKeeper cluster. There's a "binder" DNS server in each of these zones that serves DNS requests based on the contents of the ZooKeeper data store.
- Setup: When other zones are deployed, part of their configuration includes the IP addresses of the nameservice zones. These DNS servers are the only components that each zone knows about directly.
- When an instance starts up (e.g., a "moray" zone), an SMF service called the registrar connects to the ZooKeeper cluster (using the IP addresses configured with the zone) and publishes its own IP to ZooKeeper. A moray zone for shard 1 in region "us-central" publishes its own IP under "1.moray.us-central.mnx.io".
- When a client wants to contact the shard 1 moray, it makes a DNS request for 1.moray.us-central.mnx.io using the DNS servers in the ZooKeeper cluster. Those DNS servers returns all IPs that have been published for 1.moray.us-central.mnx.io.
- If the registrar in the 1.moray zone dies, the corresponding entry in the ZooKeeper data store is automatically removed, causing that zone to fall out of DNS. Due to DNS TTLs of 60s, it may take up to a minute for clients to notice that the zone is gone.
Internally, most services work this way.
Since we don't control Manta clients, the external service discovery system is
simpler and more static. We manually configure the public
us-central.manta.mnx.io
DNS name to resolve to each of the loadbalancer
public IP addresses. After a request reaches the loadbalancers, everything uses
the internal service discovery mechanism described above to contact whatever
other services they need.
The storage tier is made up of Mantis Shrimp nodes. Besides having a great deal of physical storage in order to store users' objects, these systems have lots of DRAM in order to support a large number of marlin compute zones, where we run user programs directly on their data.
Each storage node has an instance of the storage service, also called a "mako" or "shark" (as in: a shard of the storage tier). Inside this zone runs:
- mako: an nginx instance that supports simple PUT/GET for objects. This is
not the front door; this is used internally to store each copy of a user
object. Objects are stored in a ZFS delegated dataset inside the storage
zone, under
/manta/$account_uuid/$object_uuid
. - minnow: a small Node service that periodically reports storage capacity data into the metadata tier so that the front door knows how much capacity each storage node has.
In addition to the "storage" zone, each storage node has some number of marlin zones (or compute zones). These are essentially blank zones in which we run user programs. We currently configure 128 of these zones on each storage node.
The metadata tier is itself made up of three levels:
- "postgres" zones, which run instances of the postgresql database
- "moray" zones, which run a key-value store on top of postgres
- "electric-moray" zones, which handle sharding of moray requests
All object metadata is stored in PostgreSQL databases. Metadata is keyed on the object's name, and the value is a JSON document describing properties of the object including what storage nodes it's stored on.
This part is particularly complicated, so pay attention! The metadata tier is replicated for availability and sharded for scalability.
It's easiest to think of sharding first. Sharding means dividing the entire namespace into one or more shards in order to scale horizontally. So instead of storing objects A-Z in a single postgres database, we might choose two shards (A-M in shard 1, N-Z in shard 2), or three shards (A-I in shard 1, J-R in shard 2, S-Z in shard 3), and so on. Each shard is completely separate from the others. They don't overlap at all in the data that they store. The shard responsible for a given object is determined by consistent hashing on the directory name of the object. So the shard for "/mark/stor/foo" is determined by hashing "/mark/stor".
Within each shard, we use multiple postgres instances for high availability. At any given time, there's a primary peer (also called the "master"), a secondary peer (also called the "synchronous slave"), and an async peer (sometimes called the "asynchronous slave"). As the names suggest, we configure synchronous replication between the primary and secondary, and asynchronous replication between the secondary and the async peer. Synchronous replication means that transactions must be committed on both the primary and the secondary before they can be committed to the client. Asynchronous replication means that the asynchronous peer may be slightly behind the other two.
The idea with configuration replication in this way is that if the primary crashes, we take several steps to recover:
- The shard is immediately marked read-only.
- The secondary is promoted to the primary.
- The async peer is promoted to the secondary. With the shard being read-only, it should quickly catch up.
- Once the async peer catches up, the shard is marked read-write again.
- When the former primary comes back online, it becomes the asynchronous peer.
This allows us to quickly restore read-write service in the event of a postgres crash or an OS crash on the system hosting the primary. This process is managed by the "manatee" component, which uses ZooKeeper for leader election to determine which postgres will be the primary at any given time.
It's really important to keep straight the difference between sharding and replication. Even though replication means that we have multiple postgres instances in each shard, only the primary can be used for read/write operations, so we're still limited by the capacity of a single postgres instance. That's why we have multiple shards.
There are actually three kinds of metadata in Manta:
- Object metadata, which is sharded as described above. This may be medium to high volume, depending on load.
- Storage node capacity metadata, which is reported by "minnow" instances (see above) and all lives on one shard. This is extremely low-volume: a couple of writes per storage node per minute.
- Compute job state, which is all stored on a single shard. This is extremely high volume, depending on job load.
In the us-central production deployment, shard 1 stores compute job state and storage node capacity. Shards 2-4 store the object metadata.
Manta supports resharding object metadata, which would typically be used to add an additional shard (for additional capacity). This operation has never been needed (or used) in production. Assuming the service is successful, that's likely just a matter of time.
For each metadata shard (which we said above consists of three PostgreSQL databases), there's two or more "moray" instances. Moray is a key-value store built on top of postgres. Clients never talk to postgres directly; they always talk to Moray. (Actually, they generally talk to electric-moray, which proxies requests to Moray. See below.) Moray keeps track of the replication topology (which Postgres instances is the primary, which is the secondary, and which is the async) and directs all read/write requests to the primary Postgres instance. This way, clients don't need to know about the replication topology.
Like Postgres, each Moray instance is tied to a particular shard. These are typically referred to as "1.moray", "2.moray", and so on.
The electric-moray service sits in front of the sharded Moray instances and
directs requests to the appropriate shard. So if you try to update or fetch the
metadata for /mark/stor/foo
, electric-moray will hash /mark/stor
to find the
right shard and then proxy the request to one of the Moray instances operating
that shard.
The front door consists of "loadbalancer" and "webapi" zones.
"loadbalancer" zones actually run both stud (for SSL termination) and haproxy (for load balancing across the available "webapi" instances). "haproxy" is managed by a component called "muppet" that uses the DNS-based service discovery mechanism to keep haproxy's list of backends up-to-date.
"webapi" zones run the Manta-specific API server, called muskie. Muskie handles PUT/GET/DELETE requests to the front door, including requests to:
- create and delete objects
- create, list, and delete directories
- create compute jobs, submit input, end input, fetch inputs, fetch outputs, fetch errors, and cancel jobs
- create multipart uploads, upload parts, fetch multipart upload state, commit multipart uploads, and abort multipart uploads
Requests for objects and directories involve:
- validating the request
- authenticating the user (via mahi, the auth cache)
- looking up the requested object's metadata (via electric moray)
- authorizing the user for access to the specified resource
For requests on directories and zero-byte objects, the last step is to update or return the right metadata.
For write requests on objects, muskie then:
- Constructs a set of candidate storage nodes that will be used to store the object's data, where each storage node is located in a different datacenter (in a multi-DC configuration). By default, there are two copies of the data, but users can configure this by setting the durability level with the request.
- Tries to issue a PUT with 100-continue to each of the storage nodes in the candidate set. If that fails, try another set. If all sets are exhausted, fail with 503.
- Once the 100-continue is received from all storage nodes, the user's data is streamed to all nodes. Upon completion, there should be a 204 response from each storage node.
- Once the data is safely written to all nodes, the metadata tier is updated (using a PUT to electric-moray), and a 204 is returned to the client. At this point, the object's data is recorded persistently on the requested number of storage nodes, and the metadata is replicated on at least two index nodes.
For read requests on objects, muskie instead contacts each of the storage nodes hosting the data and streams data from whichever one responds first to the client.
Requests to manipulate compute jobs generally translate into creating or listing job-related Moray records:
- When the user submits a request to create a job, muskie creates a new job record in Moray.
- When the user submits a request to add an input, muskie creates a new job input record in Moray.
- When the user submits a request to cancel a job or end input, muskie modifies the job record in Moray.
- When the user lists inputs, outputs, or errors, muskie lists job input records, task output records, or error records.
All of these requests operate on the shard storing all of the compute node metadata. These requests do not go through electric-moray.
There are three core components of Marlin:
- A small fleet of supervisors manages the execution of jobs. (Supervisors used to be called workers, and you may still see that terminology). Supervisors pick up new inputs, locate the servers where the input objects are stored, issue tasks to execute on those servers, monitor the execution of those tasks, and decide when the job is done. Each supervisor can manage many jobs, but each job is managed by only one supervisor at a time.
- Job tasks execute directly on the Manta storage nodes. Each node has an agent (i.e., the "marlin agent") running in the global zone that manages tasks assigned to that node and the zones available for running those tasks.
- Within each compute zone, task execution is managed by a lackey under the control of the agent.
All of Marlin's state is stored in Moray. A few other components are involved in executing jobs:
- Muskie handles all user requests related to jobs: creating jobs, submitting input, and fetching status, outputs, and errors. To create jobs and submit inputs, Muskie creates and updates records in Moray. See above for details.
- Wrasse (the jobpuller) is a separate component that periodically scans for recently completed jobs, archives the saved state into flat objects back in Manta, and then removes job state from Moray. This is critical to keep the database that manages job state from growing forever.
When the user runs a "map" job, Muskie receives client requests to create the job, to add inputs to the job, and to indicate that there will be no more job inputs. Jobsupervisors compete to take newly assigned jobs, and exactly one will win and become responsible for orchestrating the execution of the job. As inputs are added, the supervisor resolves each object's name to the internal uuid that identifies the object and checks whether the user is allowed to access that object. Assuming the user is authorized for that object, the supervisor locates all copies of the object in the fleet, selects one, and issues a task to an agent running on the server where that copy is stored. This process is repeated for each input object, distributing work across the fleet.
The agent on the storage server accepts the task and runs the user's script in an isolated compute zone. It records any outputs emitted as part of executing the script. When the task has finished running, the agent marks it completed. The supervisor commits the completed task, marking its outputs as final job outputs. When there are no more unprocessed inputs and no uncommitted tasks, the supervisor declares the job done.
If a task fails for a retryable reason, it will be retried a few times, preferably on different servers. If it keeps failing, an error is produced.
Multi-phase map jobs are similar except that the outputs of each first-phase map task become inputs to a new second-phase map task, and only the outputs of the second phase become outputs of the job.
Reducers run like mappers, except that the input for a reducer is not completely known until the previous phase has already completed, and reducers can read an arbitrary number of inputs so the inputs themselves are dispatched as individual records and a separate end-of-input must be issued before the reducer can complete.
The agent on each storage server maintains a fixed set of compute zones in which user scripts can be run. When a map task arrives, the agent locates the file representing the input object on the local filesystem, finds a free compute zone, maps the object into the local filesystem, and runs the user's script, redirecting stdin from the input file and stdout to a local file. When the script exits, assuming it succeeds, the output file is saved as an object in the object store, recorded as an output from the task, and the task is marked completed. If there is more work to be done for the same job, the agent may choose to run it in the same compute zone without doing anything to clean up after the first one. When there is no more work to do, or the agent decides to repurpose the compute zone for another job, the compute zone is halted, the filesystem rolled back to its pristine state, and the zone is booted again to run the next task. Since the compute zones themselves are isolated from one another and they are fully rebooted and rolled back between jobs, there is no way for users' jobs to see or interfere with other jobs running in the system.
The system uses a Moray/PostgreSQL shard for all communication. There are buckets for jobs, job inputs, tasks, task inputs (for reduce tasks), task outputs, and errors. Supervisors and agents poll for new records applicable to them. For example, supervisors poll for tasks assigned to them that have been completed but not committed and agents poll for tasks assigned to them that have been dispatched but not accepted.
Multipart uploads provide an alternate way for users to upload Manta objects. The user creates the multipart upload, uploads the object in parts, and exposes the object in Manta by committing the multipart upload. Generally, these operations are implemented using existing Manta constructs:
- Parts are normal Manta objects, with a few key differences. Users cannot use the GET, POST or DELETE HTTP methods on parts. Additionally, all parts are co-located on the same set of storage nodes, which are selected when the multipart upload is created.
- All parts for a given multipart upload are stored in a parts directory, which is a normal Manta directory.
- Part directories are stored in the top-level
/$MANTA_USER/uploads
directory tree.
Most of the logic for multipart uploads is performed by Muskie, but there are some additional features of the system only used for multipart uploads:
- the manta_uploads bucket in Moray stores finalizing records for a given shard. A finalizing record is inserted atomically with the target object record when a multipart upload is committed.
- the mako zones have a custom mako-finalize operation invoked by muskie when a multipart upload is committed. This operation creates the target object from the parts and subsequently deletes the parts from disk. This operation is invoked on all storage nodes that will contain the target object when the multipart upload is committed.
Garbage collection, auditing, and metering all run as cron jobs out of the "ops" zone.
Garbage collection is the process of freeing up storage used for objects which no longer exist. When an object is deleted, muskie records that event in a log and removes the metadata from Moray, but does not actually remove the object from storage servers because there may have been other links to it. The garbage collection job (called "mola") processes these logs, along with dumps of the metadata tier (taken periodically and stored into Manta), and determines which objects can safely be deleted. These delete requests are batched and sent to each storage node, which moves the objects to a "tombstone" area. Objects in the tombstone area are deleted after a fixed interval.
Multipart upload garbage collection is the process of cleaning up data associated with finalized multipart uploads. When a multipart upload is finalized (committed or aborted), there are several items associated with it that need to be cleaned up, including:
- its upload directory metadata
- part object metadata
- part data on disk (if not removed during the
mako-finalize
operation) - its finalizing metadata
Similar to the basic garbage collection job, there is a multipart upload garbage collection job that operates on dumps of the metadata tier to determine what data associated with multipart uploads can be safely deleted. A separate operation deletes these items after the job has been completed.
Auditing is the process of ensuring that each object is replicated as expected. This is a similar job run over the contents of the metadata tier and manifests reported by the storage nodes.
Metering is the process of measuring how much resource each user used, both for reporting and billing. There's compute metering (how much compute time was used), storage metering (how much storage is used), request metering, and bandwidth metering. These are compute jobs run over the compute logs (produced by the marlin agent), the metadata dumps, and the muskie request logs.
Accelerated garbage-collection is a mechanism for lower latency storage space
reclamation for objects owned by accounts whose data is guaranteed not to be
snaplinked. Unlike traditional garbage-collection, accelerated garbage-collection
does not rely on the jobs-tier or backups of the metadata tier. However, the
actual object file deletion is still done with the cron-scheduled mako_gc.sh
script.
Manta's offline garbage-collection system relies on a Manta job that periodically
processes database dumps from all shards to locate unreferenced object data and
mark it for cleanup. Objects which are found to only be referenced in the
manta_delete_log
after a system-wide grace-period are marked for deletion.
This grace-period is intended to reduce the probability of deleting a snaplinked
object. The reason the system processes database dumps from all shards is that,
in a traditional setting, a single object may have multiple references on different
metadata nodes.
Accelerated garbage-collection processes only objects that are owned by
snaplink-disabled accounts. Such objects are guaranteed to have a single
metadata reference, and the delete records generated for them are added to the
manta_fastdelete_queue
as opposed to the manta_delete_log
. A
"garbage-collector" reads records from the queue and marks the corresponding
objects for deletion. Since metadata records are removed from the manta
table
and added to the manta_fastdelete_queue
in the same database transaction, rows
in the manta_fastdelete_queue
refer to objects that are no longer visible to
Manta users.
For Manta users whose data has never been snaplinked, the system provides an option to trade Manta's snaplink functionality for lower-latency storage space reclamation. While accelerated GC can be deployed in conjuction with the traditional offline GC system, it will only process objects owned by snaplink-disabled accounts. Objects owned by other accounts will continue to be processed by the traditional garbage-collection pipeline.
There are many dimensions to scalability.
In the metadata tier:
- number of objects (scalable with additional shards)
- number of objects in a directory (fixed, currently at a few million objects)
In the storage tier:
- total size of data (scalable with additional storage servers)
- size of data per object (limited to the amount of storage on any single system, typically in the tens of terabytes, which is far larger than is typically practical)
In terms of performance:
- total bytes in or out per second (depends on network configuration)
- count of concurrent requests (scalable with additional metadata shards or API servers)
- count of compute tasks executed per second (scalable with additional storage nodes)
- count of concurrent compute tasks (could be measured in tasks, CPU cores available, or DRAM availability; scaled with additional storage node hardware)
As described above, for most of these dimensions, Manta can be scaled horizontally by deploying more software instances (often on more hardware). For a few of these, the limits are fixed, but we expect them to be high enough for most purposes. For a few others, the limits are not known, and we've never (or rarely) run into them, but we may need to do additional work when we discover where these limits are.
Before even starting to deploy Manta, you must decide:
- the number of datacenters
- the number of metadata shards
- the number of storage and non-storage compute nodes
- how to lay out the non-storage zones across the fleet
You can deploy Manta across any odd number of datacenters in the same region (i.e., having a reliable low-latency, high-bandwidth network connection among all datacenters). We've only tested one- and three-datacenter configurations. Even-numbered configurations are not supported. See "Other configurations" below for details.
A single-datacenter installation can be made to survive server failure, but obviously cannot survive datacenter failure. The us-central deployment uses three datacenters.
Recall that each metadata shard has the storage and load capacity of a single postgres instance. If you want more capacity than that, you need more shards. Shards can be added later without downtime, but it's a delicate operation. The us-central deployment uses three metadata shards, plus a separate shard for the compute and storage capacity data.
We recommend at least two shards so that the compute and storage capacity information can be fully separated from the remaining shards, which would be used for metadata.
The two classes of node (storage nodes and non-storage nodes) usually have different hardware configurations.
The number of storage nodes needed is a function of the expected data footprint and (secondarily) the desired compute capacity.
The number of non-storage nodes required is a function of the expected load on the metadata tier. Since the point of shards is to distribute load, each shard's postgres instance should be on a separate compute node. So you want at least as many compute nodes as you will have shards. The us-central deployment distributes the other services on those same compute nodes.
For information on the latest recommended production hardware, see Triton Manufacturing Matrix and Triton Manufacturing Bill of Materials.
The us-central deployment uses older versions of the Tenderloin-A for service nodes and Mantis Shrimps for storage nodes.
Since there are so many different Manta components, and they're all deployed redundantly, there are a lot of different pieces to think about. So when setting up a Manta deployment, it's very important to think ahead of time about which components will run where!
The manta-adm genconfig
tool (when used with the --from-file option) can be
very helpful in laying out zones for Manta. See the manta-adm
manual page for
details. manta-adm genconfig --from-file
takes as input a list of physical
servers and information about each one. Large deployments that use Device 42 to
manage hardware inventory may find the
manta-genazconfig tool
useful for constructing the input for manta-adm genconfig
.
The most important production configurations are described below, but for reference, here are the principles to keep in mind:
- Storage zones should only be co-located with marlin zones, and only on storage nodes. Neither makes sense without the other, and we do not recommend combining them with other zones. All other zones should be deployed onto non-storage compute nodes.
- Nameservice: There must be an odd number of "nameservice" zones in order to achieve consensus, and there should be at least three of them to avoid a single point of failure. There must be at least one in each DC to survive any combination of datacenter partitions, and it's recommended that they be balanced across DCs as much as possible.
- For the non-sharded, non-ops-related zones (which is everything except moray, postgres, ops, madtom, marlin-dashboard), there should be at least two of each kind of zone in the entire deployment (for availability), and they should not be in the same datacenter (in order to survive a datacenter loss). For single-datacenter deployments, they should at least be on separate compute nodes.
- Only one madtom and marlin-dashboard zone is considered required. It would be good to provide more than one in separate datacenters (or at least separate compute nodes) for maintaining availability in the face of a datacenter failure.
- There should only be one ops zone. If it's unavailable for any reason, that will only temporarily affect metering, garbage collection, and reports.
- For postgres, there should be at least three instances in each shard. For multi-datacenter configurations, these instances should reside in different datacenters. For single-datacenter configurations, they should be on different compute nodes. (Postgres instances from different shards can be on the same compute node, though for performance reasons it would be better to avoid that.)
- For moray, there should be at least two instances per shard in the entire deployment, and these instances should reside on separate compute nodes (and preferably separate datacenters).
- garbage-collector zones should not be configured to poll more than 6 shards. Further, they should not be co-located with instances of other CPU-intensive Manta components (e.g. loadbalancer) to avoid interference with the data path.
Most of these constraints are required in order to maintain availability in the event of failure of any component, server, or datacenter. Below are some example configurations.
On each storage node, you should deploy one "storage" zone. We recommend deploying 128 "marlin" zones for systems with 256GB of DRAM.
If you have N metadata shards, and assuming you'll be deploying 3 postgres instances in each shard, you'd ideally want to spread these over 3N compute nodes. If you combine instances from multiple shards on the same host, you'll defeat the point of splitting those into shards. If you combine instances from the same shard on the same host, you'll defeat the point of using replication for improved availability.
You should deploy at least two Moray instances for each shard onto separate compute nodes. The remaining services can be spread over the compute nodes in whatever way, as long as you avoid putting two of the same thing onto the same compute node. Here's an example with two shards using six compute nodes:
CN1 | CN2 | CN3 | CN4 | CN5 | CN6 |
---|---|---|---|---|---|
postgres 1 | postgres 1 | postgres 1 | postgres 2 | postgres 2 | postgres 2 |
moray 1 | moray 1 | electric-moray | moray 2 | moray 2 | electric-moray |
jobsupervisor | jobsupervisor | medusa | medusa | authcache | authcache |
nameservice | nameservice | nameservice | webapi | webapi | webapi |
ops | marlin-dash | madtom | loadbalancer | loadbalancer | loadbalancer |
jobpuller | jobpuller |
In this notation, "postgres 1" and "moray 1" refer to an instance of "postgres" or "moray" for shard 1.
All three datacenters should be in the same region, meaning that they share a reliable, low-latency, high-bandwidth network connection.
On each storage node, you should deploy one "storage" zone. We recommend deploying 128 "marlin" zones for systems with 256GB of DRAM.
As with the single-datacenter configuration, you'll want to spread the postgres instances for N shards across 3N compute nodes, but you'll also want to deploy at least one postgres instance in each datacenter. For four shards, we recommend the following in each datacenter:
CN1 | CN2 | CN3 | CN4 |
---|---|---|---|
postgres 1 | postgres 2 | postgres 3 | postgres 4 |
moray 1 | moray 2 | moray 3 | moray 4 |
nameservice | nameservice | electric-moray | authcache |
ops | jobsupervisor | jobsupervisor | webapi |
webapi | jobpuller | loadbalancer | loadbalancer |
marlin-dashboard | madtom |
In this notation, "postgres 1" and "moray 1" refer to an instance of "postgres" or "moray" for shard 1.
For testing purposes, it's fine to deploy all of Manta on a single system. Obviously it won't survive server failure. This is not supported for a production deployment.
It's not supported to run Manta in an even number of datacenters since there would be no way to maintain availability in the face of an even split. More specifically:
- A two-datacenter configuration is possible but cannot survive datacenter failure or partitioning. That's because the metadata tier would require synchronous replication across two datacenters, which cannot be maintained in the face of any datacenter failure or partition. If we relax the synchronous replication constraint, then data would be lost in the event of a datacenter failure, and we'd also have no way to resolve the split-brain problem where both datacenters accept conflicting writes after the partition.
- For even numbers N >= 4, we could theoretically survive datacenter failure, but any N/2 -- N/2 split would be unresolvable. You'd likely be better off dividing the same hardware into N - 1 datacenters.
It's not supported to run Manta across multiple datacenters not in the same region (i.e., not having a reliable, low-latency, high-bandwidth connection between all pairs of datacenters).
Before you get started for anything other than a COAL or lab deployment, be sure to read and fully understand the section on "Planning a Manta deployment" above.
These general instructions should work for anything from COAL to a multi-DC, multi-compute-node deployment. The general process is:
-
Set up Triton in each datacenter, including the headnode, all Triton services, and all compute nodes you intend to use. For easier management of hosts, we recommend that the hostname reflect the type of server and, possibly, the intended purpose of the host. For example, we use the "RA" or "RM" prefix for "Richmond-A" hosts and "MS" prefix for "Mantis Shrimp" hosts.
-
In the global zone of each Triton headnode, set up a manta deployment zone using:
sdcadm post-setup common-external-nics # enable downloading service images sdcadm post-setup manta --mantav1
-
In each datacenter, generate a Manta networking configuration file.
a. For COAL, from the GZ, use:
headnode$ /zones/$(vmadm lookup alias=manta0)/root/opt/smartdc/manta-deployment/networking/gen-coal.sh > /var/tmp/netconfig.json
b. For those using the internal Engineering lab, run this from the lab.git repo:
lab.git$ node bin/genmanta.js -r RIG_NAME LAB_NAME and copy that to the headnode GZ.
c. For other deployments, see "Networking configuration" below.
-
Once you've got the networking configuration file, configure networks by running this in the global zone of each Triton headnode:
headnode$ ln -s /zones/$(vmadm lookup alias=manta0)/root/opt/smartdc/manta-deployment/networking /var/tmp/networking headnode$ cd /var/tmp/networking headnode$ ./manta-net.sh CONFIG_FILE
This step is idempotent. Note that if you are setting up a multi-DC Manta, ensure that (1) your Triton networks have cross datacenter connectivity and routing set up and (2) the Triton firewalls allow TCP and UDP traffic cross- datacenter.
-
For multi-datacenter deployments, you must link the datacenters within Triton so that the UFDS database is replicated across all three datacenters.
-
For multi-datacenter deployments, you must configure SAPI for multi-datacenter support.
-
If you'll be deploying a loadbalancer on any compute nodes other than a headnode, then you'll need to create the "external" NIC tag on those CNs. For common single-system configurations (for dev and test systems), you don't usually need to do anything for this step. For multi-CN configurations, you probably will need to do this. See the Triton documentation for how to add a NIC tag to a CN.
-
In each datacenter's manta deployment zone, run the following:
manta$ manta-init -s SIZE -e YOUR_EMAIL
manta-init
must not be run concurrently in multiple datacenters.SIZE
must be one of "coal", "lab", or "production".YOUR_EMAIL
is used to create an Amon contact for alarm notifications.This step runs various initialization steps, including downloading all of the zone images required to deploy Manta. This can take a while the first time you run it, so you may want to run it in a screen session. It's idempotent.
A common failure mode for those without quite fast internet links is a failure to import the "manta-marlin" image. The manta-marlin image is the multi-GB image that is used for zones in which Manta compute jobs run. See the "Workaround for manta-marlin image import failure" section below.
-
In each datacenter, run the following from the headnode global zone to install the "marlin" agent to each server. The marlin agent is a part of the Manta compute jobs system.
# Download the latest marlin agent image. channel=$(sdcadm channel get) marlin_agent_image=$(updates-imgadm -C $channel list name=marlin --latest -H -o uuid) marlin_agent_file=/var/tmp/marlin-agent-$marlin_agent_image.tgz if [[ -f $marlin_agent_file ]]; then echo "Already have $marlin_agent_file" else updates-imgadm -C $channel get-file $marlin_agent_image -o $marlin_agent_file fi # Copy to all other servers. sdc-oneachnode -c -X -d /var/tmp -g $marlin_agent_file # Install the agent on all servers. sdc-oneachnode -a "apm install $marlin_agent_file >>/var/log/marlin-agent-install.log 2>&1 && echo success || echo 'fail (see /var/log/marlin-agent-install.log)'" # Sanity check to ensure marlin agent is running on all storage servers. sdc-oneachnode -a 'svcs -H marlin-agent'
Note: If new servers are added to be storage nodes, the above steps must be run for each new server to install the "marlin" agent. Otherwise compute jobs will not use that server.
-
In each datacenter's manta deployment zone, deploy Manta components.
a. In COAL, just run
manta-deploy-coal
. This step is idempotent.b. For a lab machine, just run
manta-deploy-lab
. This step is idempotent.c. For any other installation (including a multi-CN installation), you'll need to run several more steps: assign shards for storage and object metadata with "manta-shardadm"; create a hash ring with "manta-adm create-topology"; generate a "manta-adm" configuration file (see "manta-adm configuration" below); and finally run "manta-adm update config.json" to deploy those zones. Your best bet is to examine the "manta-deploy-dev" script to see how it uses these tools. See "manta-adm configuration" below for details on the input file to "manta-adm update". Each of these steps is idempotent, but the shard and hash ring must be set up before deploying any zones.
-
If desired, set up connectivity to the "ops", "marlin-dashboard", and "madtom" zones. See "Overview of Operating Manta" below for details.
-
For multi-datacenter deployments, set the MUSKIE_MULTI_DC SAPI property. This is required to enforce that object writes are distributed to multiple datacenters. In the SAPI master datacenter:
headnode $ app_uuid="$(sdc-sapi /applications?name=manta | json -Ha uuid)" headnode $ echo '{ "metadata": { "MUSKIE_MULTI_DC": true } }' | \ sapiadm update "$app_uuid"
Repeat the following in each datacenter.
headnode $ manta-oneach -s webapi 'svcadm restart "*muskie*"'
-
If you wish to enable basic monitoring, run the following in each datacenter:
manta-adm alarm config update
to deploy Amon probes and probe groups shipped with Manta. This will cause alarms to be opened when parts of Manta are not functioning. Email notifications are enabled by default using the address provided to
manta-init
above. (Email notifications only work after you have configured the Amon service for sending email.) If you want to be notified about alarm events via XMPP, see "Changing alarm contact methods" below. -
In development environments with more than one storage zone on a single system, it may be useful to apply quotas to storage zones so that if the system fills up, there remains space in the storage pool to address the problem. You can do this by finding the total size of the storage pool using
zfs list zones
in the global zone:# zfs list zones NAME USED AVAIL REFER MOUNTPOINT zones 77.5G 395G 612K /zones
Determine how much you want to allow the storage zones to use. In this case, we'll allow the zones to use 100 GiB each, making up 300 GiB, or 75% of the available storage. Now, find the storage zones:
# manta-adm show storage SERVICE SH ZONENAME GZ ADMIN IP storage 1 15711409-ca77-4204-b733-1058f14997c1 172.25.10.4 storage 1 275dd052-5592-45aa-a371-5cd749dba3b1 172.25.10.4 storage 1 b6d2c71f-ec3d-497f-8b0e-25f970cb2078 172.25.10.4
and for each one, update the quota using
vmadm update
. You can apply a 100 GiB quota to all of the storage zones on a single-system Manta using:manta-adm show -H -o zonename storage | while read zonename; do vmadm update $zonename quota=100; done
Note: This only prevents Manta storage zones from using more disk space than you've budgeted for them. If the rest of the system uses more than expected, you could still run out of disk space. To avoid this, make sure that all zones have quotas and the sum of quotas does not exceed the space available on the system.
Background: Manta operators are responsible for basic monitoring of components, including monitoring disk usage to avoid components running out of disk space. Out of the box, Manta stops using storage zones that are nearly full. This mechanism relies on storage zones reporting how full they are, which they determine by dividing used space by available space. However, Manta storage zones are typically deployed without quotas, which means their available space is limited by the total space in the ZFS storage pool. This accounting is not correct when there are multiple storage zones on the same system.
To make this concrete, consider a system with 400 GiB of total ZFS pool space. Suppose there are three storage zones, each using 100 GiB of space, and suppose that the rest of the system uses negligible storage space. In this case, there are 300 GiB in use overall, so there's 100 GiB available in the pool. As a result, each zone reports that it's using 100 GiB and has 100 GiB available, so it's 50% full. In reality, though, the system is 75% full. Each zone reports that it has 100 GiB free, but if we were to write just 33 GiB to each zone, the whole system would be full.
This problem only affects deployments that place multiple storage zones on the same system, which is not typical outside of development. In development, the problem can be worked around by applying appropriate quotas in each zone (as described above).
Once the above steps have been completed, there are a few steps you should consider doing to ensure a working deployment.
If you haven't already done so, you will need to install the Manta CLI tools.
To test Manta with the Manta CLI tools, you will need an account
configured in Triton. You can either use one of the default configured
accounts or setup your own. The most common method is to test using the
poseidon
account which is created by the Manta install.
In either case, you will need access to the Operations Portal. See the instructions here on how to find the IP address of the Operations Portal from your headnode.
Log into the Operations Portal:
- COAL users should use login
admin
and the password you initially setup. - Lab users will also use
admin
, but need to ask whoever provisioned your lab account for the password.
Once in, follow these instructions to add ssh keys to the account of your choice.
Once you have setup an account on Manta or added your ssh keys added to an existing account, you can test your Manta install with the Manta CLI tools you installed above in "Prerequisites".
There are complete instructions on how to get started with the CLI tools on the apidocs page.
Some things in that guide will not be as clear for users of custom deployments.
-
The biggest difference will be the setting of the
MANTA_URL
variable. You will need to find the IP address of your API endpoint. To do this from your headnode:headnode$ manta-adm show -H -o primary_ip loadbalancer
Multiple addresses will be returned. Choose any one and set
MANTA_URL
tohttps://$that_ip
. -
MANTA_USER
will be the account you setup in "Set up a Manta Account" section. -
MANTA_KEY_ID
will be the ssh key id you added in "Set up a Manta Account" section. -
If the key you used is in an environment that has not installed a certificate signed by a recognized authority you might see
Error: self signed certificate
errors. To fix this, addMANTA_TLS_INSECURE=true
to your environment or shell config.
A final ~/.bashrc
or ~/.bash_profile
might look something like:
export MANTA_USER=poseidon
export MANTA_URL=https://<your-loadbalancer-ip>
export MANTA_TLS_INSECURE=true
export MANTA_KEY_ID=`ssh-keygen -l -f ~/.ssh/id_rsa.pub | awk '{print $2}' | tr -d '\n'`
Lastly test the CLI tools from your development machine:
$ echo "Hello, Manta" > /tmp/hello.txt
$ mput -f /tmp/hello.txt ~~/stor/hello-foo
.../stor/hello-foo [=======================================================>] 100% 13B
$ mls ~~/stor/
hello-foo
$ mget ~~/stor/hello-foo
Hello, Manta
While the manta-init
deployment script creates the SAPI garbage-collector
service, none of the default deployment layouts include any garbage-collector
instances. The accelerated gc functionality is intended to be an extension to
Manta, so it is deployed and managed with a separate manta-adm
mode after the
core Manta components are deployed.
Accelerated garbage-collection can only be used on objects that have no
snaplinks. Part of the deployment sequence involves disabling snaplinks for a
given user that owns the objects to be deleted. In order to prevent dangling
metadata references, it is important first verify that the user does not own
any snaplinked objects. Manta does not have tools to run this check, but one
method for doing it might be to scan the Manta table on each shard primary to
generate a mapping from objectid
to _key
and search for collisions.
headnode$ updates-imgadm import name=manta-garbage-collector --latest
In order to leverage accelerated garbage-collection, there must be at least one snaplink-disabled user in the Manta deployment. Snaplinks can be disabled using the following command:
headnode$ manta-adm accel-gc enable <ACCOUNT-LOGIN>
To list the accounts for which accelerated garbage-collection is enabled run:
headnode$ manta-adm accel-gc accounts
To disable accelerated garbage-collection (re-enabling snaplinks) for an account:
headnode$ manta-adm accel-gc disable <ACCOUNT-LOGIN>
Since the snaplink-disabled bit for each account is stored as SAPI application metadata, enabling/disabling snaplinks requires restarting all Muskies in the Manta deployment:
headnode$ manta-oneach -s webapi "svcadm restart
svc:/manta/application/muskie:muskie-*"
It is also necessary to restart garbage-collectors for the same reason.
headnode$ manta-oneach -s garbage-collector
"svcadm restart garbage-collector"
After the restarts are complete any requests to snaplink objects owned by
<ACCOUNT-LOGIN>
will fail with a 403 error, and all successful requests to delete
objects created and/or owned by the snaplink-disabled account will result in new
records added to the manta_fastdelete_queue
(and not the manta_delete_log
).
To summarize, objects owned by a snaplink-disabled account will not be garbage-collected until:
- Shards storing those objects are assigned to some garbage-collector
- The account has been marked snaplink-disabled, and that bit of configuration is picked up by all the Muskies and Garbage-collectors in the delpoyment.
A simple check that the accelerated GC pipeline is working can be done by
logging into an index shard and continuously monitoring the size of the
manta_fastdelete_queue
while issuing requests to delete objects owned by the
snaplink-disabled account. Depending on the batch size configuration of the
garbage-collectors in the local Manta deployment, it could take roughly 30
seconds for records inserted into the queue to be removed, but the queue
should always eventually be cleared.
For more sophisticated progress monitoring, see the node-artedi probes described in the garbage-collector readme.
garbage-collector instances are deployed just like any other Manta service: by
including them in the service layout configuration and then passing the config to
manta-adm update
. The manta-adm accel-gc
mode includes its own genconfig-like
functionality that layers any number of garbage-collector instances onto the
current deployment configuration. To generate a configuration with N
garbage-collectors, run:
headnode$ IMAGEUUID=$(sdc-imgadm list -H -o uuid name=manta-garbage-collector --latest)
headnode$ manta-adm accel-gc genconfig $IMAGEUUID N &> config-with-gc.json
manta-adm accel-gc genconfig
functions similarly to manta-adm genconfig
which
print a json-formatted deployment layout. Like manta-adm genconfig
, the 'gc'
variant also enforces a set of criteria for choosing which compute nodes to add
garbage-collector instance to:
manta-adm accel-gc
will avoid co-locating garbage-collector zones with either loadbalancer zones or nameservice zones. For delete-heavy workloads, garbage-collector instances will generally use more CPU and memory resources. Separating theses process from system-critical services is a precaution against catastrophic service degradation.manta-adm accel-gc
will evenly distribute the garbage-collector zones among the CNs with least number of deployed instances that also meet the first criterion
If the above constraints cannot be met in a deployment (this will likely be
the case for most non-production deployments), the operator may choose to ignore
them with the --ignore-criteria
or -i
switch:
headnode$ manta-adm accel-gc genconfig -i $IMAGEUUID N &> config-with-gc.json
Once you are happy with configuration, run:
headnode$ manta-adm update config-with-gc.json
to deploy the new zones.
Each garbage-collector instance runs an administration server that can be used to extract information about its configuration. A good check to run to ensure that the previous step went well is:
headnode$ manta-oneach -s garbage-collector \
"curl -X GET localhost:2020/shards"
Here is an example of what the output of the previous command might look like in a deployment with two garbage-collectors:
SERVICE ZONE OUTPUT
garbage-collector 2bd87578 []
garbage-collector 756e17e4 []
The output indicates that the collectors are healthy, but they haven't been tasked with processing records from any shards yet.
By default, garbage-collectors come up idle because they are not configured to read records from any metadata shards. The configuration that indicates which collector should process records from which shard is stored in each collector's SAPI instance object. The benefit of storing this mapping in each instance object is that the configuration survives reprovisions and can be managed with the existing SAPI interface.
Each garbage-collector instance object stores a list of index shard identifiers
in a metadata array called GC_ASSIGNED_SHARDS
. This array can be modified by
hand with tools like sapiadm
and sdc-sapi
, but it is simpler to manage this
configuration with manta-adm accel-gc
. Once the collectors have been deployed,
run:
headnode$ manta-adm accel-gc gen-shard-assignment &> gc-shard-assignment.json
to generate a mapping from garbage-collector instances to index shards. The operation will distribute all of the index shards in the Manta deployment to the garbage-collectors as evenly as possible. An example of how such a configuration would look on a deployment with two index shards is:
{
"756e17e4-bfd2-4bc5-b3b6-166480dd0614": {
"GC_ASSIGNED_SHARDS": [
{
"host": "2.moray.orbit.example.com",
"last": true
}
]
},
"2bd87578-285f-4937-aff1-181a7f32dcc0": {
"GC_ASSIGNED_SHARDS": [
{
"host": "1.moray.orbit.example.com",
"last": true
}
]
}
}
Note that data in the GC_ASSIGNED_SHARDS
array comes from the SAPI
Manta application's INDEX_MORAY_SHARDS
metadata field.
The keys of the object are uuids of garbage-collector instances, and the values
are changes that will be applied to the metadata
blob of the corresponding
instance's SAPI object. The last object in the GC_ASSIGNED_SHARDS
array must
have a "last" field set to true. This is a convention used by Manta's templating
engine.
Once you've generated a mapping from collectors to shards and saved it in
gc-shard-assignment.json
, run:
headnode$ manta-adm accel-gc update gc-shard-assignment.json
to update the corresponding SAPI metadata. There are a few points to stress here:
- This operation replaces any existing mapping from garbage-collectors
to shards. This behavior makes the assignment idempotent, which mimics the
semantics of
manta-adm update
. - Running this operation does not immediately cause the garbage-collectors to start reading records from the newly assigned shards. That requires restarting all the modified collectors.
If new shards or new garbage-collectors are added to the Manta deployment,
you may run manta-adm accel-gc gen-shard-assignment
followed by manta-adm accel-gc update
to re-assign the new pool of shards to the new pool of
collectors.
In order for the configuration change in the previous step to take effect, you will need to restart the garbage-collectors. This can be done with the following manta-oneach query:
headnode$ manta-oneach -s garbage-collector
"svcadm restart config-agent; sleep 5; svcadm restart garbage-collector"
Restarting the config-agent ensures that the changes have been picked up and that the local process configuration file has been re-rendered. After the garbage-collectors have been restarted, you can verify the shard assignment took effect with:
headnode$ manta-oneach -s garbage-collector "curl -X GET localhost:2020/shards"
Which should now show the following output:
SERVICE ZONE OUTPUT
garbage-collector 2bd87578 [{"host":"1.moray.orbit.example.com"}]
garbage-collector 756e17e4 [{"host":"2.moray.orbit.example.com"}]
The networking configuration file is a per-datacenter JSON file with several properties:
Property | Kind | Description |
---|---|---|
azs |
array of strings | list of all availability zones (datacenters) participating in Manta in this region |
this_az |
string | string (in azs ) denoting this availability zone |
manta_nodes |
array of strings | list of server uuid's for all servers participating in Manta in this AZ |
marlin_nodes |
array of strings | list of server uuid's (subset of manta_nodes ) that are storage nodes |
admin |
object | describes the "admin" network in this datacenter (see below) |
manta |
object | describes the "manta" network in this datacenter (see below) |
marlin |
object | describes the "marlin" network in this datacenter (see below) |
nic_mappings |
object | maps each server in manta_nodes to an object mapping each network name ("manta" and "marlin") to the network interface on the server that should be tagged |
mac_mappings |
object | (deprecated) maps each server uuid from manta_nodes to an object mapping each network name ("admin", "manta", and "marlin") to the MAC address on that server over which that network should be created. |
distribute_svcs |
boolean | control switch over boot-time networking detection performed by manta-net.sh (see below) |
"admin", "manta", and "marlin" all describe these networks that are built into Manta:
admin
: the Triton administrative networkmanta
: the Manta administrative network, used for high-volume communication between all Manta services.marlin
: the network used for compute zones. This is usually a network that gives out private IPs that are NAT'd to the internet so that users can contact the internet from Marlin jobs, but without needing their own public IP for each zone.
Each of these is an object with several properties:
Property | Kind | Description |
---|---|---|
network |
string | Name for the Triton network object (usually the same as the network name) |
nic_tag |
string | NIC tag name for this network (usually the same as the network name) |
Besides those two, each of these blocks has a property for the current availability zone that describes the "subnet", "gateway", "vlan_id", and "start" and "end" provisionable addresses.
nic_mappings
is a nested object structure defining the network interface to be
tagged for each server defined in manta_nodes
, and for each of Manta's
required networks. See below for an example of this section of the
configuration.
Note: If aggregations are used, they must already exist in NAPI, and updating NIC tags on aggregations will require a reboot of the server in question.
The optional boolean distribute_svcs
gives control to the operator over the
boot-time networking detection that happens each time manta-net.sh
is executed
(which determines if the global zone SMF services should be distributed). For
example, an operator has enabled boot-time networking in a datacenter after
installing Manta, and subsequently would like to add some more storage nodes.
For consistency, the operator can set distribute_svcs
to true
in order to
force distribution of these global zone services.
Note: For global zone network changes handled by boot-time networking to take effect, a reboot of the node must be performed. See Triton's virtual networking documentation for more information on boot-time networking.
For reference, here's an example multi-datacenter configuration with one service node (aac3c402-3047-11e3-b451-002590c57864) and one storage node (445aab6c-3048-11e3-9816-002590c3f3bc):
{
"this_az": "staging-1",
"manta_nodes": [
"aac3c402-3047-11e3-b451-002590c57864",
"445aab6c-3048-11e3-9816-002590c3f3bc"
],
"marlin_nodes": [
"445aab6c-3048-11e3-9816-002590c3f3bc"
],
"azs": [
"staging-1",
"staging-2",
"staging-3"
],
"admin": {
"nic_tag": "admin",
"network": "admin",
"staging-1": {
"subnet": "172.25.3.0/24",
"gateway": "172.25.3.1"
},
"staging-2": {
"subnet": "172.25.4.0/24",
"gateway": "172.25.4.1"
},
"staging-3": {
"subnet": "172.25.5.0/24",
"gateway": "172.25.5.1"
}
},
"manta": {
"nic_tag": "manta",
"network": "manta",
"staging-1": {
"vlan_id": 3603,
"subnet": "172.27.3.0/24",
"start": "172.27.3.4",
"end": "172.27.3.254",
"gateway": "172.27.3.1"
},
"staging-2": {
"vlan_id": 3604,
"subnet": "172.27.4.0/24",
"start": "172.27.4.4",
"end": "172.27.4.254",
"gateway": "172.27.4.1"
},
"staging-3": {
"vlan_id": 3605,
"subnet": "172.27.5.0/24",
"start": "172.27.5.4",
"end": "172.27.5.254",
"gateway": "172.27.5.1"
}
},
"marlin": {
"nic_tag": "mantanat",
"network": "mantanat",
"staging-1": {
"vlan_id": 3903,
"subnet": "172.28.64.0/19",
"start": "172.28.64.4",
"end": "172.28.95.254",
"gateway": "172.28.64.1"
},
"staging-2": {
"vlan_id": 3904,
"subnet": "172.28.96.0/19",
"start": "172.28.96.4",
"end": "172.28.127.254",
"gateway": "172.28.96.1"
},
"staging-3": {
"vlan_id": 3905,
"subnet": "172.28.128.0/19",
"start": "172.28.128.4",
"end": "172.28.159.254",
"gateway": "172.28.128.1"
}
},
"nic_mappings": {
"aac3c402-3047-11e3-b451-002590c57864": {
"manta": {
"mac": "90:e2:ba:4b:ec:d1"
}
},
"445aab6c-3048-11e3-9816-002590c3f3bc": {
"manta": {
"mac": "90:e2:ba:4a:32:71"
},
"mantanat": {
"aggr": "aggr0"
}
}
}
The deprecated mac_mappings
can also be used in place of nic_mappings
. Only
one of nic_mappings
or mac_mappings
is supported per network configuration
file.
In a multi-datacenter configuration, this would be used to configure the "staging-1" datacenter. There would be two more configuration files, one for "staging-2" and one for "staging-3".
A common failure mode with manta-init ...
for those without a fast internet
link is a failure to import the large "manta-marlin" image. This is a multi-GB
image used for the zones in which Manta compute jobs run. The problem is that
the large image can make it easy to hit the one hour timeout for the
IMGAPI AdminImportRemoteImage
endpoint used to import Manta images. Neither this endpoint nor the
https://updates.tritondatacenter.com server hosting the images supports resumable
downloads.
Here is a manual workaround (run the following from the headnode global zone):
cd /var/tmp
# Determine the UUID of the latest "manta-marlin" image on updates.tritondatacenter.com.
muuid=$(updates-imgadm list name=manta-marlin --latest -H -o uuid)
# Download directly from a separate manual download area in Manta.
curl -kO https://us-central.manta.mnx.io/Joyent_Dev/public/Manta/manta-marlin-image/$muuid.imgmanifest
# First ensure that the origin (i.e. parent) image is installed
origin=$(json -f $muuid.imgmanifest origin)
[[ -z "$origin" ]] \
|| sdc-imgadm get $origin >/dev/null \
|| sdc-imgadm import $origin -S https://updates.tritondatacenter.com
# If that failed, then the separate download area doesn't have a recent
# image. Please log an issue.
[[ $? -ne 0 ]] && echo log an issue at https://github.com/TritonDataCenter/manta/issues/
# If the following is interrupted, then re-run the same command to resume:
curl -kO -C - https://us-central.manta.mnx.io/Joyent_Dev/public/Manta/manta-marlin-image/$muuid.file.gz
# Verify the download checksum
[[ $(json -f $muuid.imgmanifest | json files.0.sha1) \
== $(openssl dgst -sha1 $muuid.file.gz | awk '{print $2}') ]] \
|| echo "error downloading, please delete and retry"
# Then install this image into the DC's IMGAPI:
sdc-imgadm import -m $muuid.imgmanifest -f $muuid.file.gz
"manta-adm" is the tool we use both to deploy all of the Manta zones and then to provision new zones, deprovision old zones, or reprovision old zones with a new image. "manta-adm" also has commands for viewing what's deployed, showing information about compute nodes, and more, but this section only discusses the configuration file format.
A manta-adm configuration file takes the form:
{
"COMPUTE_NODE_UUID": {
"SERVICE_NAME": {
"IMAGE_UUID": COUNT_OF_ZONES
},
"SHARDED_SERVICE_NAME": {
"SHARD_NUMBER": {
"IMAGE_UUID": COUNT_OF_ZONES
},
}
},
}
The file specifies how many of each kind of zone should be deployed on each compute node. For most zones, the "kind" of zone is just the service name (e.g., "storage"). For sharded zones, you also have to specify the shard number.
After you've run manta-init
, you can generate a sample configuration for a
single-system install using "manta-adm genconfig". Use that to give you an
idea of what this looks like:
$ manta-adm genconfig coal
{
"<any>": {
"nameservice": {
"197e905a-d15d-11e3-90e2-6bf8f0ea92b3": 1
},
"postgres": {
"1": {
"92782f28-d236-11e3-9e6c-5f7613a4df37": 2
}
},
"moray": {
"1": {
"ef659002-d15c-11e3-a5f6-4bf577839d16": 1
}
},
"electric-moray": {
"e1043ddc-ca82-11e3-950a-ff14d493eebf": 1
},
"storage": {
"2306b44a-d15d-11e3-8f87-6b9768efe5ae": 2
},
"authcache": {
"5dff63a4-d15c-11e3-a312-5f3ea4981729": 1
},
"webapi": {
"319afbfa-d15e-11e3-9aa9-33ebf012af8f": 1
},
"loadbalancer": {
"7aac4c88-d15c-11e3-9ea6-dff0b07f5db1": 1
},
"jobsupervisor": {
"7cf43bb2-d16c-11e3-b157-cb0adb200998": 1
},
"jobpuller": {
"1b0f00e4-ca9b-11e3-ba7f-8723c9cd3ce7": 1
},
"medusa": {
"bb6e5424-d0bb-11e3-8527-676eccc2a50a": 1
},
"ops": {
"ab253aae-d15d-11e3-8f58-3fb986ce12b3": 1
},
"marlin": {
"1c18ae6e-cf70-473a-a22c-f3536d6ea789": 2
}
}
}
This file effectively specifies all of the Manta components except for the platforms and Marlin agents.
You can generate a configuration file that describes your current deployment
with manta-adm show -s -j
.
For a coal or lab deployment, your best bet is to save the output of manta-adm genconfig coal
or manta-adm genconfig lab
to a file and use that. This is
what the manta-deploy-coal
and manta-deploy-lab
scripts do, and you may as
well just use those.
Once you have a file like this, you can pass it to manta-adm update
, which
will show you what it will do in order to make the deployment match the
configuration file, and then it will go ahead and do it. For more information,
see "manta-adm help update".
There are two distinct methods of updating instances: you may deploy additional instances, or you may reprovision existing instances.
With the first method (new instances), additional instances are provisioned using a newer image. This approach allows you to add additional capacity without disrupting extant instances, and may prove useful when an operator needs to validate a new version of a service before adding it to the fleet.
With the second method (reprovision), this update will swap one image out for a newer image, while preserving any data in the instance's delegated dataset. Any data or customizations in the instance's main dataset, i.e. zones/UUID, will be lost. Services which have persistent state (manatee, mako, redis) must use this method to avoid discarding their data. This update moves the service offline for 15-30 seconds. If the image onto which an image is reprovisioned doesn't work, the instance can be reprovisioned back to its original image.
This procedure uses "manta-adm" to do the upgrade, which uses the reprovisioning method for all zones other than the "marlin" zones.
-
Figure out which image you want to install. You can list available images by running updates-imgadm:
headnode$ updates-imgadm list name=manta-jobsupervisor | tail
Replace manta-jobsupervisor with some other image name, or leave it off to see all images. Typically you'll want the most recent one. Note the uuid of the image in the first column.
-
Figure out which zones you want to reprovision. In the headnode GZ of a given datacenter, you can enumerate the zones and versions for a given manta_role using:
headnode$ manta-adm show jobsupervisor
You'll want to note the VM UUIDs for the instances you want to update.
Run this in each datacenter:
-
Download updated images. The supported approach is to re-run the
manta-init
command that you used when initially deploying Manta inside the manta-deployment zone. For us-central, use:$ manta-init -e manta+us-central@tritondatacenter.com -s production -c 10
Do not run
manta-init
concurrently in multiple datacenters. -
Inside the Manta deployment zone, generate a configuration file representing the current deployment state:
$ manta-adm show -s -j > config.json
-
Modify the file as desired. See "manta-adm configuration" above for details on the format. In most cases, you just need to change the image uuid for the service that you're updating. You can get the latest image for a service with the following command:
$ sdc-sapi "/services?name=[service name]&include_master=true" | \ json -Ha params.image_uuid
-
Pass the updated file to
manta-adm update
:$ manta-adm update config.json
Do not run
manta-adm update
concurrently in multiple datacenters. -
Update the alarm configuration as needed. See "Amon Alarm Updates" below for details.
Run this procedure for each datacenter whose Marlin agents you want to upgrade.
-
Find the build you want to use using
updates-imgadm
in the global-zone of the headnode.headnode$ updates-imgadm list name=marlin
-
Fetch the desired tarball to /var/tmp on the headnode. The file will be named
<UUID>-file.gz
.headnode$ uuid=<UUID> headnode$ cd /var/tmp headnode$ updates-imgadm get-file -O "$uuid"
-
Copy the tarball to each of the storage nodes:
headnode$ manta-oneach -G -s storage \ -d /var/tmp -g "/var/tmp/${uuid}-file.gz"
-
Apply the update to all shrimps with:
headnode$ manta-oneach -G -s storage \ "/opt/smartdc/agents/bin/apm install /var/tmp/${uuid}-file.gz"
-
Verify that agents are online with:
headnode$ manta-oneach -G -s storage 'svcs marlin-agent'
-
Make sure most of the compute zones become ready. For this, you can use the dashboard or run this periodically:
headnode$ manta-oneach -G -s storage mrzones
Note: The "apm install" operation will restart the marlin-agent service, which has the impact of aborting any currently-running tasks on that system and causi