Skip to content

Commit

Permalink
zfs: support force exporting pools
Browse files Browse the repository at this point in the history
This is primarily of use when a pool has lost its disk, while the user
doesn't care about any pending (or otherwise) transactions.

Implement various control methods to make this feasible:
- txg_wait can now take a NOSUSPEND flag, in which case the caller will
  be alerted if their txg can't be committed.  This is primarily of
  interest for callers that would normally pass TXG_WAIT, but don't want
  to wait if the pool becomes suspended, which allows unwinding in some
  cases, specifically when one is attempting a non-forced export.
  Without this, the non-forced export would preclude a forced export
  by virtue of holding the namespace lock indefinitely.
- txg_wait also returns failure for TXG_WAIT users if a pool is actually
  being force exported.  Adjust most callers to tolerate this.
- spa_config_enter_flags now takes a NOSUSPEND flag to the same effect.
- DMU objset initiator which may be set on an objset being forcibly
  exported / unmounted.
- SPA export initiator may be set on a pool being forcibly exported.
- DMU send/recv now use an interruption mechanism which relies on the
  SPA export initiator being able to enumerate datasets and closing any
  send/recv streams, causing their EINTR paths to be invoked.
- ZIO now has a cancel entry point, which tells all suspended zios to
  fail, and which suppresses the failures for non-CANFAIL users.
- metaslab, etc. cleanup, which consists of simply throwing away any
  changes that were not able to be synced out.
- Linux specific: introduce a new tunable,
  zfs_forced_export_unmount_enabled, which allows the filesystem to
  remain in a modified 'unmounted' state upon exiting zpl_umount_begin,
  to achieve parity with FreeBSD and illumos,
  which have VFS-level support for yanking filesystems out from under
  users.  However, this only helps when the user is actively performing
  I/O, while not sitting on the filesystem.  In particular, this allows
  test #3 below to pass on Linux.
- Add basic logic to zpool to indicate a force-exporting pool, instead
  of crashing due to lack of config, etc.

Add tests which cover the basic use cases:
- Force export while a send is in progress
- Force export while a recv is in progress
- Force export while POSIX I/O is in progress

This change modifies the libzfs ABI:
- New ZPOOL_STATUS_FORCE_EXPORTING zpool_status_t enum value.
- New field libzfs_force_export for libzfs_handle.

Signed-off-by:	Will Andrews <will@firepipe.net>
Signed-off-by:	Allan Jude <allan@klarasystems.com>
Signed-off-by:  Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Sponsored-by:	Klara, Inc.
Sponsored-by:	Catalogics, Inc.
Sponsored-by:	Wasabi Technology, Inc.
Closes openzfs#3461
  • Loading branch information
wca authored and oshogbo committed Mar 17, 2023
1 parent fa46802 commit 8a73b28
Show file tree
Hide file tree
Showing 84 changed files with 2,049 additions and 408 deletions.
9 changes: 6 additions & 3 deletions cmd/zpool/zpool_main.c
Original file line number Diff line number Diff line change
Expand Up @@ -363,7 +363,7 @@ get_usage(zpool_help_t idx)
case HELP_DETACH:
return (gettext("\tdetach <pool> <device>\n"));
case HELP_EXPORT:
return (gettext("\texport [-af] <pool> ...\n"));
return (gettext("\texport [-afF] <pool> ...\n"));
case HELP_HISTORY:
return (gettext("\thistory [-il] [<pool>] ...\n"));
case HELP_IMPORT:
Expand Down Expand Up @@ -1907,7 +1907,7 @@ zpool_export_one(zpool_handle_t *zhp, void *data)
{
export_cbdata_t *cb = data;

if (zpool_disable_datasets(zhp, cb->force) != 0)
if (zpool_disable_datasets(zhp, cb->force || cb->hardforce) != 0)
return (1);

/* The history must be logged as part of the export */
Expand All @@ -1928,10 +1928,13 @@ zpool_export_one(zpool_handle_t *zhp, void *data)
*
* -a Export all pools
* -f Forcefully unmount datasets
* -F Forcefully export, dropping all outstanding dirty data
*
* Export the given pools. By default, the command will attempt to cleanly
* unmount any active datasets within the pool. If the '-f' flag is specified,
* then the datasets will be forcefully unmounted.
* then the datasets will be forcefully unmounted. If the '-F' flag is
* specified, the pool's dirty data, if any, will simply be dropped after a
* best-effort attempt to forcibly stop all activity.
*/
int
zpool_do_export(int argc, char **argv)
Expand Down
1 change: 1 addition & 0 deletions include/libzfs.h
Original file line number Diff line number Diff line change
Expand Up @@ -419,6 +419,7 @@ typedef enum {
ZPOOL_STATUS_NON_NATIVE_ASHIFT, /* (e.g. 512e dev with ashift of 9) */
ZPOOL_STATUS_COMPATIBILITY_ERR, /* bad 'compatibility' property */
ZPOOL_STATUS_INCOMPATIBLE_FEAT, /* feature set outside compatibility */
ZPOOL_STATUS_FORCE_EXPORTING, /* pool is being force exported */

/*
* Finally, the following indicates a healthy pool.
Expand Down
3 changes: 3 additions & 0 deletions include/os/freebsd/spl/sys/thread.h
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,7 @@

#define getcomm() curthread->td_name
#define getpid() curthread->td_tid
#define thread_signal spl_kthread_signal
extern int spl_kthread_signal(kthread_t *tsk, int sig);

#endif
2 changes: 2 additions & 0 deletions include/os/freebsd/zfs/sys/zfs_znode_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,8 @@ zfs_enter(zfsvfs_t *zfsvfs, const char *tag)
return (0);
}

#define zfs_enter_unmountok zfs_enter

/* Must be called before exiting the vop */
static inline void
zfs_exit(zfsvfs_t *zfsvfs, const char *tag)
Expand Down
2 changes: 2 additions & 0 deletions include/os/linux/spl/sys/thread.h
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ typedef void (*thread_func_t)(void *);
__thread_create(stk, stksize, (thread_func_t)func, #func, \
arg, len, pp, state, pri)

#define thread_signal(t, s) spl_kthread_signal(t, s)
#define thread_exit() spl_thread_exit()
#define thread_join(t) VERIFY(0)
#define curthread current
Expand All @@ -64,6 +65,7 @@ extern kthread_t *__thread_create(caddr_t stk, size_t stksize,
int state, pri_t pri);
extern struct task_struct *spl_kthread_create(int (*func)(void *),
void *data, const char namefmt[], ...);
extern int spl_kthread_signal(kthread_t *tsk, int sig);

static inline __attribute__((noreturn)) void
spl_thread_exit(void)
Expand Down
3 changes: 2 additions & 1 deletion include/os/linux/zfs/sys/zfs_vfsops_os.h
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,8 @@ struct zfsvfs {
boolean_t z_utf8; /* utf8-only */
int z_norm; /* normalization flags */
boolean_t z_relatime; /* enable relatime mount option */
boolean_t z_unmounted; /* unmounted */
boolean_t z_unmounted; /* mount status */
boolean_t z_force_unmounted; /* force-unmounted status */
rrmlock_t z_teardown_lock;
krwlock_t z_teardown_inactive_lock;
list_t z_all_znodes; /* all znodes in the fs */
Expand Down
27 changes: 21 additions & 6 deletions include/os/linux/zfs/sys/zfs_znode_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -98,24 +98,39 @@ extern "C" {
#define zhold(zp) VERIFY3P(igrab(ZTOI((zp))), !=, NULL)
#define zrele(zp) iput(ZTOI((zp)))

#define zfsvfs_is_unmounted(zfsvfs) \
((zfsvfs)->z_unmounted || (zfsvfs)->z_force_unmounted)

/* Must be called before exiting the operation. */
static inline void
zfs_exit(zfsvfs_t *zfsvfs, const char *tag)
{
zfs_exit_fs(zfsvfs);
ZFS_TEARDOWN_EXIT_READ(zfsvfs, tag);
}

/* Called on entry to each ZFS inode and vfs operation. */
static inline int
zfs_enter(zfsvfs_t *zfsvfs, const char *tag)
{
ZFS_TEARDOWN_ENTER_READ(zfsvfs, tag);
if (unlikely(zfsvfs->z_unmounted)) {
if (unlikely(zfsvfs_is_unmounted(zfsvfs))) {
ZFS_TEARDOWN_EXIT_READ(zfsvfs, tag);
return (SET_ERROR(EIO));
}
return (0);
}

/* Must be called before exiting the operation. */
static inline void
zfs_exit(zfsvfs_t *zfsvfs, const char *tag)
/* ZFS_ENTER but ok with forced unmount having begun */
static inline int
zfs_enter_unmountok(zfsvfs_t *zfsvfs, const char *tag)
{
zfs_exit_fs(zfsvfs);
ZFS_TEARDOWN_EXIT_READ(zfsvfs, tag);
ZFS_TEARDOWN_ENTER_READ(zfsvfs, tag);
if (unlikely((zfsvfs)->z_unmounted == B_TRUE)) {
zfs_exit(zfsvfs, tag);
return (SET_ERROR(EIO));
}
return (0);
}

static inline int
Expand Down
1 change: 1 addition & 0 deletions include/sys/arc.h
Original file line number Diff line number Diff line change
Expand Up @@ -338,6 +338,7 @@ void l2arc_fini(void);
void l2arc_start(void);
void l2arc_stop(void);
void l2arc_spa_rebuild_start(spa_t *spa);
void l2arc_spa_rebuild_stop(spa_t *spa);

#ifndef _KERNEL
extern boolean_t arc_watch;
Expand Down
1 change: 1 addition & 0 deletions include/sys/dmu.h
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,7 @@ typedef enum dmu_object_type {
#define TXG_NOWAIT (0ULL)
#define TXG_WAIT (1ULL<<0)
#define TXG_NOTHROTTLE (1ULL<<1)
#define TXG_NOSUSPEND (1ULL<<2)

void byteswap_uint64_array(void *buf, size_t size);
void byteswap_uint32_array(void *buf, size_t size);
Expand Down
1 change: 1 addition & 0 deletions include/sys/dmu_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,7 @@ typedef struct dmu_sendstatus {
list_node_t dss_link;
int dss_outfd;
proc_t *dss_proc;
kthread_t *dss_thread;
offset_t *dss_off;
uint64_t dss_blocks; /* blocks visited during the sending process */
} dmu_sendstatus_t;
Expand Down
5 changes: 5 additions & 0 deletions include/sys/dmu_objset.h
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,7 @@ struct objset {

/* Protected by os_lock */
kmutex_t os_lock;
kthread_t *os_shutdown_initiator;
multilist_t os_dirty_dnodes[TXG_SIZE];
list_t os_dnodes;
list_t os_downgraded_dbufs;
Expand Down Expand Up @@ -259,6 +260,10 @@ int dmu_fsname(const char *snapname, char *buf);
void dmu_objset_evict_done(objset_t *os);
void dmu_objset_willuse_space(objset_t *os, int64_t space, dmu_tx_t *tx);

int dmu_objset_shutdown_register(objset_t *os);
boolean_t dmu_objset_exiting(objset_t *os);
void dmu_objset_shutdown_unregister(objset_t *os);

void dmu_objset_init(void);
void dmu_objset_fini(void);

Expand Down
4 changes: 4 additions & 0 deletions include/sys/dmu_recv.h
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ extern const char *const recv_clone_name;

typedef struct dmu_recv_cookie {
struct dsl_dataset *drc_ds;
kthread_t *drc_initiator;
struct dmu_replay_record *drc_drr_begin;
struct drr_begin *drc_drrb;
const char *drc_tofs;
Expand All @@ -57,6 +58,8 @@ typedef struct dmu_recv_cookie {
nvlist_t *drc_keynvl;
uint64_t drc_fromsnapobj;
uint64_t drc_ivset_guid;
unsigned int drc_flags;
void *drc_rwa;
void *drc_owner;
cred_t *drc_cred;
proc_t *drc_proc;
Expand All @@ -83,6 +86,7 @@ int dmu_recv_begin(const char *, const char *, dmu_replay_record_t *,
boolean_t, boolean_t, boolean_t, nvlist_t *, nvlist_t *, const char *,
dmu_recv_cookie_t *, zfs_file_t *, offset_t *);
int dmu_recv_stream(dmu_recv_cookie_t *, offset_t *);
int dmu_recv_close(dsl_dataset_t *ds);
int dmu_recv_end(dmu_recv_cookie_t *, void *);
boolean_t dmu_objset_is_receiving(objset_t *);

Expand Down
1 change: 1 addition & 0 deletions include/sys/dmu_send.h
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ int dmu_send_obj(const char *pool, uint64_t tosnap, uint64_t fromsnap,
boolean_t embedok, boolean_t large_block_ok, boolean_t compressok,
boolean_t rawok, boolean_t savedok, int outfd, offset_t *off,
struct dmu_send_outparams *dso);
int dmu_send_close(struct dsl_dataset *ds);

typedef int (*dmu_send_outfunc_t)(objset_t *os, void *buf, int len, void *arg);
typedef struct dmu_send_outparams {
Expand Down
7 changes: 6 additions & 1 deletion include/sys/dsl_dataset.h
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,8 @@ typedef struct dsl_dataset {
kmutex_t ds_sendstream_lock;
list_t ds_sendstreams;

struct dmu_recv_cookie *ds_receiver;

/*
* When in the middle of a resumable receive, tracks how much
* progress we have made.
Expand Down Expand Up @@ -324,7 +326,8 @@ typedef struct dsl_dataset_rename_snapshot_arg {
/* flags for holding the dataset */
typedef enum ds_hold_flags {
DS_HOLD_FLAG_NONE = 0 << 0,
DS_HOLD_FLAG_DECRYPT = 1 << 0 /* needs access to encrypted data */
DS_HOLD_FLAG_DECRYPT = 1 << 0, /* needs access to encrypted data */
DS_HOLD_FLAG_MUST_BE_OPEN = 1 << 1, /* dataset must already be open */
} ds_hold_flags_t;

int dsl_dataset_hold(struct dsl_pool *dp, const char *name, const void *tag,
Expand Down Expand Up @@ -453,6 +456,8 @@ void dsl_dataset_long_hold(dsl_dataset_t *ds, const void *tag);
void dsl_dataset_long_rele(dsl_dataset_t *ds, const void *tag);
boolean_t dsl_dataset_long_held(dsl_dataset_t *ds);

int dsl_dataset_sendrecv_cancel_all(spa_t *spa);

int dsl_dataset_clone_swap_check_impl(dsl_dataset_t *clone,
dsl_dataset_t *origin_head, boolean_t force, void *owner, dmu_tx_t *tx);
void dsl_dataset_clone_swap_sync_impl(dsl_dataset_t *clone,
Expand Down
2 changes: 1 addition & 1 deletion include/sys/dsl_scan.h
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ int dsl_scan(struct dsl_pool *, pool_scan_func_t);
void dsl_scan_assess_vdev(struct dsl_pool *dp, vdev_t *vd);
boolean_t dsl_scan_scrubbing(const struct dsl_pool *dp);
int dsl_scrub_set_pause_resume(const struct dsl_pool *dp, pool_scrub_cmd_t cmd);
void dsl_scan_restart_resilver(struct dsl_pool *, uint64_t txg);
int dsl_scan_restart_resilver(struct dsl_pool *, uint64_t txg);
boolean_t dsl_scan_resilvering(struct dsl_pool *dp);
boolean_t dsl_scan_resilver_scheduled(struct dsl_pool *dp);
boolean_t dsl_dataset_unstable(struct dsl_dataset *ds);
Expand Down
1 change: 1 addition & 0 deletions include/sys/metaslab.h
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ boolean_t metaslab_class_throttle_reserve(metaslab_class_t *, int, int,
zio_t *, int);
void metaslab_class_throttle_unreserve(metaslab_class_t *, int, int, zio_t *);
void metaslab_class_evict_old(metaslab_class_t *, uint64_t);
void metaslab_class_force_discard(metaslab_class_t *);
uint64_t metaslab_class_get_alloc(metaslab_class_t *);
uint64_t metaslab_class_get_space(metaslab_class_t *);
uint64_t metaslab_class_get_dspace(metaslab_class_t *);
Expand Down
21 changes: 17 additions & 4 deletions include/sys/spa.h
Original file line number Diff line number Diff line change
Expand Up @@ -836,16 +836,13 @@ extern kmutex_t spa_namespace_lock;
* SPA configuration functions in spa_config.c
*/

#define SPA_CONFIG_UPDATE_POOL 0
#define SPA_CONFIG_UPDATE_VDEVS 1

extern void spa_write_cachefile(spa_t *, boolean_t, boolean_t, boolean_t);
extern void spa_config_load(void);
extern nvlist_t *spa_all_configs(uint64_t *);
extern void spa_config_set(spa_t *spa, nvlist_t *config);
extern nvlist_t *spa_config_generate(spa_t *spa, vdev_t *vd, uint64_t txg,
int getstats);
extern void spa_config_update(spa_t *spa, int what);
extern int spa_config_update_pool(spa_t *spa);
extern int spa_config_parse(spa_t *spa, vdev_t **vdp, nvlist_t *nv,
vdev_t *parent, uint_t id, int atype);

Expand Down Expand Up @@ -962,6 +959,13 @@ extern void spa_iostats_trim_add(spa_t *spa, trim_type_t type,
uint64_t extents_written, uint64_t bytes_written,
uint64_t extents_skipped, uint64_t bytes_skipped,
uint64_t extents_failed, uint64_t bytes_failed);

/* Config lock handling flags */
typedef enum {
SCL_FLAG_TRYENTER = 1U << 0,
SCL_FLAG_NOSUSPEND = 1U << 1,
} spa_config_flag_t;

extern void spa_import_progress_add(spa_t *spa);
extern void spa_import_progress_remove(uint64_t spa_guid);
extern int spa_import_progress_set_mmp_check(uint64_t pool_guid,
Expand All @@ -974,6 +978,8 @@ extern int spa_import_progress_set_state(uint64_t pool_guid,
/* Pool configuration locks */
extern int spa_config_tryenter(spa_t *spa, int locks, const void *tag,
krw_t rw);
extern int spa_config_enter_flags(spa_t *spa, int locks, const void *tag,
krw_t rw, spa_config_flag_t flags);
extern void spa_config_enter(spa_t *spa, int locks, const void *tag, krw_t rw);
extern void spa_config_exit(spa_t *spa, int locks, const void *tag);
extern int spa_config_held(spa_t *spa, int locks, krw_t rw);
Expand Down Expand Up @@ -1022,6 +1028,7 @@ extern uint64_t spa_last_synced_txg(spa_t *spa);
extern uint64_t spa_first_txg(spa_t *spa);
extern uint64_t spa_syncing_txg(spa_t *spa);
extern uint64_t spa_final_dirty_txg(spa_t *spa);
extern void spa_verify_dirty_txg(spa_t *spa, uint64_t txg);
extern uint64_t spa_version(spa_t *spa);
extern pool_state_t spa_state(spa_t *spa);
extern spa_load_state_t spa_load_state(spa_t *spa);
Expand All @@ -1041,6 +1048,8 @@ extern metaslab_class_t *spa_dedup_class(spa_t *spa);
extern metaslab_class_t *spa_preferred_class(spa_t *spa, uint64_t size,
dmu_object_type_t objtype, uint_t level, uint_t special_smallblk);

extern void spa_evicting_os_lock(spa_t *);
extern void spa_evicting_os_unlock(spa_t *);
extern void spa_evicting_os_register(spa_t *, objset_t *os);
extern void spa_evicting_os_deregister(spa_t *, objset_t *os);
extern void spa_evicting_os_wait(spa_t *spa);
Expand Down Expand Up @@ -1132,6 +1141,10 @@ extern void spa_history_log_internal_dd(dsl_dir_t *dd, const char *operation,

extern const char *spa_state_to_name(spa_t *spa);

extern boolean_t spa_exiting_any(spa_t *spa);
extern boolean_t spa_exiting(spa_t *spa);
extern int spa_operation_interrupted(spa_t *spa);

/* error handling */
struct zbookmark_phys;
extern void spa_log_error(spa_t *spa, const zbookmark_phys_t *zb);
Expand Down
1 change: 1 addition & 0 deletions include/sys/spa_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,7 @@ struct spa {
kmutex_t spa_evicting_os_lock; /* Evicting objset list lock */
list_t spa_evicting_os_list; /* Objsets being evicted. */
kcondvar_t spa_evicting_os_cv; /* Objset Eviction Completion */
kthread_t *spa_export_initiator; /* thread exporting the pool */
txg_list_t spa_vdev_txg_list; /* per-txg dirty vdev list */
vdev_t *spa_root_vdev; /* top-level vdev container */
uint64_t spa_min_ashift; /* of vdevs in normal class */
Expand Down
Loading

0 comments on commit 8a73b28

Please sign in to comment.