Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve vstart_runner to (optionally) create its own cluster #12800

Merged
merged 5 commits into from Jan 31, 2017

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Jan 5, 2017

This is possible now that the qa-suite code is in the main ceph repo

John Spray added 4 commits January 5, 2017 13:43
Instead of hunting around the filesystem for
ceph-qa-suite, get it from our own location.

Signed-off-by: John Spray <john.spray@redhat.com>
Previously this could get hung up if we killed one
PID and then the daemon reappears with a different
one (perhaps because we caught it during
daemonization?)

Signed-off-by: John Spray <john.spray@redhat.com>
Useful for vstart_runner.py to create a cluster
with no filesystem (CephFSTestCase does the
filesystem creation)

Signed-off-by: John Spray <john.spray@redhat.com>
Convenient when you want to create a fresh cluster
each test run: just pass --create and you'll get
a cluster with the right number of daemons for
the tests you're running.

Signed-off-by: John Spray <john.spray@redhat.com>
@jcsp jcsp added cephfs Ceph File System tests labels Jan 5, 2017
@jcsp jcsp requested a review from batrick January 5, 2017 13:45
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. I want to give this a try but won't be able to until I get back to the USA. Too many troubles with the VPN.

@@ -3,16 +3,16 @@
ceph instance instead of a packaged/installed cluster. Use this to turn around test cases
quickly during development.

Usage (assuming teuthology, ceph, ceph-qa-suite checked out in ~/git):
Simple usage (assuming teuthology, ceph, ceph-qa-suite checked out in ~/git):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove "ceph-qa-suite" above.

# Invoke a test using this script, with PYTHONPATH set appropriately
python ~/git/ceph-qa-suite/tasks/vstart_runner.py
# Invoke a test using this script
python ~/git/ceph-qa-suite/tasks/vstart_runner.py --create tasks.cephfs.test_data_scan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command needs updated to ~/git/ceph/qa/....

# Invoke a test using this script
python ~/git/ceph-qa-suite/tasks/vstart_runner.py --create tasks.cephfs.test_data_scan

Alternative usage:

# Alternatively, if you use different paths, specify them as follows:
LD_LIBRARY_PATH=`pwd`/lib PYTHONPATH=~/git/teuthology:~/git/ceph-qa-suite:`pwd`/../src/pybind:`pwd`/lib/cython_modules/lib.2 python ~/git/ceph-qa-suite/tasks/vstart_runner.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These paths should be updated too.

@batrick batrick self-assigned this Jan 12, 2017
@batrick
Copy link
Member

batrick commented Jan 18, 2017

I tried running the example test:

python ../qa/tasks/vstart_runner.py --interactive --create tasks.cephfs.test_data_scan

I had to fix a few permission denied errors by adding some sudos. Here's a patch:

diff --git a/qa/tasks/cephfs/fuse_mount.py b/qa/tasks/cephfs/fuse_mount.py
index 896ca5c67f..20934f2b27 100644
--- a/qa/tasks/cephfs/fuse_mount.py
+++ b/qa/tasks/cephfs/fuse_mount.py
@@ -49,7 +49,7 @@ class FuseMount(CephFSMount):
             daemon_signal,
         ]
 
-        fuse_cmd = ['ceph-fuse', "-f"]
+        fuse_cmd = ['sudo', 'ceph-fuse', "-f"]
 
         if mount_path is not None:
             fuse_cmd += ["--client_mountpoint={0}".format(mount_path)]
diff --git a/qa/tasks/vstart_runner.py b/qa/tasks/vstart_runner.py
index 2e8f2de2ab..838a52222e 100644
--- a/qa/tasks/vstart_runner.py
+++ b/qa/tasks/vstart_runner.py
@@ -192,7 +192,10 @@ class LocalRemoteProcess(object):
         if self.subproc.pid and not self.finished:
             log.info("kill: killing pid {0} ({1})".format(
                 self.subproc.pid, self.args))
-            safe_kill(self.subproc.pid)
+            if self.args[0] == 'sudo':
+                subprocess.call(['sudo', 'kill', str(self.subproc.pid)])
+            else:
+                safe_kill(self.subproc.pid)
         else:
             log.info("kill: already terminated ({0})".format(self.args))
 
@@ -234,9 +237,6 @@ class LocalRemote(object):
             logger=None, label=None, env=None):
         log.info("run args={0}".format(args))
 
-        # We don't need no stinkin' sudo
-        args = [a for a in args if a != "sudo"]
-
         # We have to use shell=True if any run.Raw was present, e.g. &&
         shell = any([a for a in args if isinstance(a, Raw)])
 
@@ -438,7 +438,7 @@ class LocalFuseMount(FuseMount):
 
         def list_connections():
             self.client_remote.run(
-                args=["mount", "-t", "fusectl", "/sys/fs/fuse/connections", "/sys/fs/fuse/connections"],
+                args=["sudo", "mount", "-t", "fusectl", "/sys/fs/fuse/connections", "/sys/fs/fuse/connections"],
                 check_status=False
             )
             p = self.client_remote.run(
@@ -460,7 +460,7 @@ class LocalFuseMount(FuseMount):
         pre_mount_conns = list_connections()
         log.info("Pre-mount connections: {0}".format(pre_mount_conns))
 
-        prefix = [os.path.join(BIN_PREFIX, "ceph-fuse")]
+        prefix = ["sudo", os.path.join(BIN_PREFIX, "ceph-fuse")]
         if os.getuid() != 0:
             prefix += ["--client-die-on-failed-remount=false"]
 

After fixing that, I ran across a strange error or newly uncovered bug. A datafile file is created as a normal user (i.e. my uid, not root) with permissions 660. After damage and recovery, the file has a different owner and permissions. Here's the failure in the test run:

2017-01-17 22:48:08,950.950 INFO:__main__:run args=['sudo', 'chmod', '1777', '/tmp/tmpR0Vgeg/mnt.0']
2017-01-17 22:48:08,950.950 INFO:__main__:Running ['sudo', 'chmod', '1777', '/tmp/tmpR0Vgeg/mnt.0']
2017-01-17 22:48:08,972.972 INFO:__main__:run args=['getfattr', '--only-values', '-n', 'ceph.file.layout.object_size', './datafile']
2017-01-17 22:48:08,972.972 INFO:__main__:Running ['getfattr', '--only-values', '-n', 'ceph.file.layout.object_size', './datafile']
./datafile: ceph.file.layout.object_size: Permission denied
2017-01-17 22:48:08,987.987 INFO:__main__:test_rebuild_nondefault_layout (tasks.cephfs.test_data_scan.TestDataScan) ... ERROR
2017-01-17 22:48:08,987.987 ERROR:__main__:Traceback (most recent call last):
  File "/home/pdonnell/scm/ceph/qa/tasks/cephfs/test_data_scan.py", line 413, in test_rebuild_nondefault_layout
    self._rebuild_metadata(NonDefaultLayout(self.fs, self.mount_a))
  File "/home/pdonnell/scm/ceph/qa/tasks/cephfs/test_data_scan.py", line 387, in _rebuild_metadata
    errors = workload.validate()
  File "/home/pdonnell/scm/ceph/qa/tasks/cephfs/test_data_scan.py", line 298, in validate
    p = self._mount.run_shell(["getfattr", "--only-values", "-n", "ceph.file.layout.object_size", "./datafile"])
  File "../qa/tasks/vstart_runner.py", line 412, in run_shell
    args, wait=wait, cwd=self.mountpoint
  File "../qa/tasks/vstart_runner.py", line 299, in run
    proc.wait()
  File "../qa/tasks/vstart_runner.py", line 174, in wait
    raise CommandFailedError(self.args, self.exitstatus)
CommandFailedError: Command failed with status 1: ['getfattr', '--only-values', '-n', 'ceph.file.layout.object_size', './datafile']

2017-01-17 22:48:08,987.987 ERROR:__main__:Error in test 'test_rebuild_nondefault_layout (tasks.cephfs.test_data_scan.TestDataScan)', going interactive

To debug, I added a stat after the creation of datafile to check its permissions/owner:

  File: datafile
  Size: 0               Blocks: 0          IO Block: 4194304 regular empty file
Device: 2eh/46d Inode: 1099511627776  Links: 1
Access: (0640/-rw-r-----)  Uid: ( 1000/pdonnell)   Gid: (  100/   users)
Access: 2017-01-17 22:47:38.758004657 -0500
Modify: 2017-01-17 22:47:38.758004657 -0500
Change: 2017-01-17 22:47:38.761191174 -0500
 Birth: -

Here's the stat after the getfattr failure:

$ stat datafile
  File: datafile
  Size: 33554432        Blocks: 65536      IO Block: 4194304 regular file
Device: 2eh/46d Inode: 1099511627776  Links: 1
Access: (0500/-r-x------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2017-01-17 22:47:45.000000000 -0500
Modify: 2017-01-17 22:47:45.000000000 -0500
Change: 2017-01-17 22:47:45.000000000 -0500
 Birth: -

So it went to owner root and permissions 0500. Looks like a bug or is this something else?

@batrick
Copy link
Member

batrick commented Jan 18, 2017

To be clear, the patch doesn't necessarily need to be part of this PR I think and the bug is obviously a separate problem. What do you think @jcsp?

@jcsp
Copy link
Contributor Author

jcsp commented Jan 18, 2017

@batrick when cephfs-data-scan injects a file from just its data objects, it invents some new metadata -- the uid/gid gets set to root/root.

Last I ran these tests they definitely didn't require root, I wonder if you've changed machines and need to update configuration to enable mounting ceph fuse as non-root?

@batrick
Copy link
Member

batrick commented Jan 18, 2017

@batrick when cephfs-data-scan injects a file from just its data objects, it invents some new metadata -- the uid/gid gets set to root/root.

Isn't that damage supposed to be repaired before getfattr gets executed?

Last I ran these tests they definitely didn't require root, I wonder if you've changed machines and need to update configuration to enable mounting ceph fuse as non-root?

I'm using my laptop as the rex machines' disks are full. I am able to do fuse mounts as a normal user. I was trying to resolve problems I saw here after a failure (undoing my patch and retrying):

2017-01-18 09:14:05,942.942 INFO:__main__:run args=['mkdir', '--', '/tmp/tmpNNmLuZ/mnt.0']
2017-01-18 09:14:05,942.942 INFO:__main__:Running ['mkdir', '--', '/tmp/tmpNNmLuZ/mnt.0']
2017-01-18 09:14:05,952.952 INFO:__main__:run args=['mount', '-t', 'fusectl', '/sys/fs/fuse/connections', '/sys/fs/fuse/connections']
2017-01-18 09:14:05,952.952 INFO:__main__:Running ['mount', '-t', 'fusectl', '/sys/fs/fuse/connections', '/sys/fs/fuse/connections']
mount: only root can use "--types" option
2017-01-18 09:14:05,968.968 INFO:__main__:run args=['ls', '/sys/fs/fuse/connections']
2017-01-18 09:14:05,968.968 INFO:__main__:Running ['ls', '/sys/fs/fuse/connections']
2017-01-18 09:14:05,974.974 INFO:__main__:Pre-mount connections: [45]
2017-01-18 09:14:05,974.974 INFO:__main__:run args=['./bin/ceph-fuse', '--client-die-on-failed-remount=false', '-f', '--name', 'client.0', '/tmp/tmpNNmLuZ/mnt.0']
2017-01-18 09:14:05,974.974 INFO:__main__:Running ['./bin/ceph-fuse', '--client-die-on-failed-remount=false', '-f', '--name', 'client.0', '/tmp/tmpNNmLuZ/mnt.0']
2017-01-18 09:14:05,979.979 INFO:__main__:Mounting client.0 with pid 21739
2017-01-18 09:14:05,980.980 INFO:__main__:run args=['mount', '-t', 'fusectl', '/sys/fs/fuse/connections', '/sys/fs/fuse/connections']
2017-01-18 09:14:05,980.980 INFO:__main__:Running ['mount', '-t', 'fusectl', '/sys/fs/fuse/connections', '/sys/fs/fuse/connections']
mount: only root can use "--types" option
2017-01-18 09:14:05,987.987 INFO:__main__:run args=['ls', '/sys/fs/fuse/connections']
2017-01-18 09:14:05,987.987 INFO:__main__:Running ['ls', '/sys/fs/fuse/connections']
2017-01-18 09:14:06,995.995 INFO:__main__:run args=['mount', '-t', 'fusectl', '/sys/fs/fuse/connections', '/sys/fs/fuse/connections']
2017-01-18 09:14:06,996.996 INFO:__main__:Running ['mount', '-t', 'fusectl', '/sys/fs/fuse/connections', '/sys/fs/fuse/connections']
mount: only root can use "--types" option
2017-01-18 09:14:07,015.015 INFO:__main__:run args=['ls', '/sys/fs/fuse/connections']
2017-01-18 09:14:07,016.016 INFO:__main__:Running ['ls', '/sys/fs/fuse/connections']
2017-01-18 09:14:07,034.034 INFO:__main__:Post-mount connections: [45, 48]
2017-01-18 09:14:07,035.035 INFO:__main__:run args=['stat', '--file-system', '--printf=%T\n', '--', '/tmp/tmpNNmLuZ/mnt.0']
2017-01-18 09:14:07,035.035 INFO:__main__:Running ['stat', '--file-system', '--printf=%T\n', '--', '/tmp/tmpNNmLuZ/mnt.0']
2017-01-18 09:14:07,055.055 INFO:tasks.cephfs.fuse_mount:ceph-fuse is mounted on /tmp/tmpNNmLuZ/mnt.0
2017-01-18 09:14:07,056.056 INFO:__main__:run args=['sudo', 'chmod', '1777', '/tmp/tmpNNmLuZ/mnt.0']
2017-01-18 09:14:07,057.057 INFO:__main__:Running ['chmod', '1777', '/tmp/tmpNNmLuZ/mnt.0']
chmod: changing permissions of '/tmp/tmpNNmLuZ/mnt.0': Operation not permitted
2017-01-18 09:14:07,076.076 INFO:__main__:test_rebuild_backtraceless (tasks.cephfs.test_data_scan.TestDataScan) ... ERROR
2017-01-18 09:14:07,076.076 ERROR:__main__:Traceback (most recent call last):
  File "/home/pdonnell/scm/ceph/qa/tasks/cephfs/test_data_scan.py", line 403, in test_rebuild_backtraceless
    self._rebuild_metadata(BacktracelessFile(self.fs, self.mount_a))
  File "/home/pdonnell/scm/ceph/qa/tasks/cephfs/test_data_scan.py", line 383, in _rebuild_metadata
    self.mount_a.wait_until_mounted()
  File "/home/pdonnell/scm/ceph/qa/tasks/cephfs/fuse_mount.py", line 194, in wait_until_mounted
    args=['sudo', 'chmod', '1777', self.mountpoint])
  File "../qa/tasks/vstart_runner.py", line 296, in run
    proc.wait()
  File "../qa/tasks/vstart_runner.py", line 174, in wait
    raise CommandFailedError(self.args, self.exitstatus)
CommandFailedError: Command failed with status 1: ['chmod', '1777', '/tmp/tmpNNmLuZ/mnt.0']

2017-01-18 09:14:07,077.077 ERROR:__main__:Error in test 'test_rebuild_backtraceless (tasks.cephfs.test_data_scan.TestDataScan)', going interactive
Ceph test interactive mode, use ctx to interact with the cluster, press control-D to exit...

I had noticed the mount -t fusectl ... failed so tried adding sudo which worked. Adding sudo to the ceph-fuse call also resolved the above chmod failure.

Looking back in the logs, I remember I also saw this:

2017-01-18 09:13:24,966.966 INFO:__main__:kill
2017-01-18 09:13:24,967.967 INFO:__main__:kill: killing pid 18987 (['./bin/ceph-fuse', '--client-die-on-failed-remount=false', '-f', '--name', 'client.0', '/tmp/tmpNNmLuZ/mnt.0'])
2017-01-18 09:13:18.603116 7f1fb3c3ff40 -1 WARNING: all dangerous and experimental features are enabled.
2017-01-18 09:13:18.603263 7f1fb3c3ff40 -1 WARNING: all dangerous and experimental features are enabled.
2017-01-18 09:13:18.605003 7f1fb3c3ff40 -1 WARNING: all dangerous and experimental features are enabled.
2017-01-18 09:13:18.605606 7f1fb3c3ff40 -1 init, newargv = 0x55f1443ed260 newargc=11
ceph-fuse[18987]: starting ceph client
ceph-fuse[18987]: starting fuse
mount: only root can use "--options" option
2017-01-18 09:13:18.620514 7f1fa60b3700 -1 client.4194 Failed to invoke remount, needed to ensure kernel dcache consistency
ceph-fuse[18987]: fuse finished with error 0 and tester_r 0

which made me think ceph-fuse had failed to start but now I see it was just normal log output after the process exited.

So I think I don't actually need root for ceph-fuse (except for the dcache consistency) but I still have this chmod problem. Have you seen this before?

...to use paths pointing to ceph tree, not
ceph-qa-suite tree.

Signed-off-by: John Spray <john.spray@redhat.com>
@jcsp
Copy link
Contributor Author

jcsp commented Jan 19, 2017

Updated the docstring

@jcsp
Copy link
Contributor Author

jcsp commented Jan 19, 2017

@batrick it looks like somehow LocalFuseMount (vstart_runners special version) isn't getting used somehow on your system -- avoiding the chmod is one of the reasons it exists. I'm pretty sure this was working on my desktop but I can check when I get home

@jcsp
Copy link
Contributor Author

jcsp commented Jan 26, 2017

@batrick yeah, this definitely works for me as an unprivileged user -- I suspect the breakage is either nothing to do with this PR, or you've got something funny in your python paths?

@batrick
Copy link
Member

batrick commented Jan 26, 2017

I will give it another try. Maybe is something misconfigured, ya.

@batrick
Copy link
Member

batrick commented Jan 30, 2017

@jcsp, tried again and got the same error. I don't have anything in my PYTHONPATH. I'm just using a virtualenv with teuthology and then starting vstart_runner.py:

$ python ../qa/tasks/vstart_runner.py --interactive --create tasks.cephfs.test_data_scan
[...]
2017-01-30 15:52:24,996.996 INFO:__main__:run args=['mkdir', '--', '/tmp/tmpCteqsJ/mnt.0']
2017-01-30 15:52:24,996.996 INFO:__main__:Running ['mkdir', '--', '/tmp/tmpCteqsJ/mnt.0']
2017-01-30 15:52:25,003.003 INFO:__main__:run args=['mount', '-t', 'fusectl', '/sys/fs/fuse/connections', '/sys/fs/fuse/connections']
2017-01-30 15:52:25,004.004 INFO:__main__:Running ['mount', '-t', 'fusectl', '/sys/fs/fuse/connections', '/sys/fs/fuse/connections']
mount: only root can use "--types" option
2017-01-30 15:52:25,013.013 INFO:__main__:run args=['ls', '/sys/fs/fuse/connections']
2017-01-30 15:52:25,013.013 INFO:__main__:Running ['ls', '/sys/fs/fuse/connections']
2017-01-30 15:52:25,024.024 INFO:__main__:Pre-mount connections: [45]
2017-01-30 15:52:25,024.024 INFO:__main__:run args=['./bin/ceph-fuse', '--client-die-on-failed-remount=false', '-f', '--name', 'client.0', '/tmp/tmpCteqsJ/mnt.0']
2017-01-30 15:52:25,025.025 INFO:__main__:Running ['./bin/ceph-fuse', '--client-die-on-failed-remount=false', '-f', '--name', 'client.0', '/tmp/tmpCteqsJ/mnt.0']
2017-01-30 15:52:25,034.034 INFO:__main__:Mounting client.0 with pid 2732
2017-01-30 15:52:25,035.035 INFO:__main__:run args=['mount', '-t', 'fusectl', '/sys/fs/fuse/connections', '/sys/fs/fuse/connections']
2017-01-30 15:52:25,035.035 INFO:__main__:Running ['mount', '-t', 'fusectl', '/sys/fs/fuse/connections', '/sys/fs/fuse/connections']
mount: only root can use "--types" option
2017-01-30 15:52:25,050.050 INFO:__main__:run args=['ls', '/sys/fs/fuse/connections']
2017-01-30 15:52:25,051.051 INFO:__main__:Running ['ls', '/sys/fs/fuse/connections']
2017-01-30 15:52:26,068.068 INFO:__main__:run args=['mount', '-t', 'fusectl', '/sys/fs/fuse/connections', '/sys/fs/fuse/connections']
2017-01-30 15:52:26,069.069 INFO:__main__:Running ['mount', '-t', 'fusectl', '/sys/fs/fuse/connections', '/sys/fs/fuse/connections']
mount: only root can use "--types" option
2017-01-30 15:52:26,093.093 INFO:__main__:run args=['ls', '/sys/fs/fuse/connections']
2017-01-30 15:52:26,094.094 INFO:__main__:Running ['ls', '/sys/fs/fuse/connections']
2017-01-30 15:52:26,104.104 INFO:__main__:Post-mount connections: [39, 45]
2017-01-30 15:52:26,104.104 INFO:__main__:run args=['stat', '--file-system', '--printf=%T\n', '--', '/tmp/tmpCteqsJ/mnt.0']
2017-01-30 15:52:26,104.104 INFO:__main__:Running ['stat', '--file-system', '--printf=%T\n', '--', '/tmp/tmpCteqsJ/mnt.0']
2017-01-30 15:52:26,115.115 INFO:tasks.cephfs.fuse_mount:ceph-fuse is mounted on /tmp/tmpCteqsJ/mnt.0
2017-01-30 15:52:26,116.116 INFO:__main__:run args=['sudo', 'chmod', '1777', '/tmp/tmpCteqsJ/mnt.0']
2017-01-30 15:52:26,116.116 INFO:__main__:Running ['chmod', '1777', '/tmp/tmpCteqsJ/mnt.0']
chmod: changing permissions of '/tmp/tmpCteqsJ/mnt.0': Operation not permitted
2017-01-30 15:52:26,128.128 INFO:__main__:test_rebuild_backtraceless (tasks.cephfs.test_data_scan.TestDataScan) ... ERROR
2017-01-30 15:52:26,129.129 ERROR:__main__:Traceback (most recent call last):
  File "/home/pdonnell/scm/ceph/qa/tasks/cephfs/test_data_scan.py", line 403, in test_rebuild_backtraceless
    self._rebuild_metadata(BacktracelessFile(self.fs, self.mount_a))
  File "/home/pdonnell/scm/ceph/qa/tasks/cephfs/test_data_scan.py", line 383, in _rebuild_metadata
    self.mount_a.wait_until_mounted()
  File "/home/pdonnell/scm/ceph/qa/tasks/cephfs/fuse_mount.py", line 194, in wait_until_mounted
    args=['sudo', 'chmod', '1777', self.mountpoint])
  File "../qa/tasks/vstart_runner.py", line 296, in run
    proc.wait()
  File "../qa/tasks/vstart_runner.py", line 174, in wait
    raise CommandFailedError(self.args, self.exitstatus)
CommandFailedError: Command failed with status 1: ['chmod', '1777', '/tmp/tmpCteqsJ/mnt.0']

2017-01-30 15:52:26,129.129 ERROR:__main__:Error in test 'test_rebuild_backtraceless (tasks.cephfs.test_data_scan.TestDataScan)', going interactive
Ceph test interactive mode, use ctx to interact with the cluster, press control-D to exit...
>>> print(os.getenv("PYTHONPATH"))
None

By the debug log, it appears it is using LocalFuseMount as expected.

In any case I think you can merge this if you want. This is most likely not relating to this PR. Other tests like tasks.cephfs.test_readahead work just fine.

@jcsp jcsp merged commit d4f6385 into ceph:master Jan 31, 2017
@jcsp jcsp deleted the wip-vstart-qasuite branch January 31, 2017 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cephfs Ceph File System tests
Projects
None yet
2 participants