Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for AWS S3 #60

Merged
merged 17 commits into from Apr 2, 2013
Merged

Conversation

howech
Copy link
Contributor

@howech howech commented Mar 23, 2013

Drake is fantastic work! I hope that my small contribution can be used to make it better.

This pull request add the ability to interact with objects stored in s3.

s3 paths are of the form s3://bucket/path/to/object

aws credentials are loaded from a file identified on command line arguments (eg --aws-credentials=/path/to/credendials.properties). The file should be a properties file containing "access_key" and "access_secret" properties with appropriate values. Other properties in the file are ignored. If you use s3cmd, you can just use the same ~/.s3cfg file).

In order to get access to command line options and because of cyclic dependencies, I had to split the options declaration and support code out to a separate file.

I also wrote some tests in the test/drake/test/fs.clj, paralleling the local file system tests, but I left them commented out, as they depend on a lot of things that are by no means generic. I am new to clojure, so I have not figured out yet how to mock the s3 libraries in a way that would allow me to write more proper unit tests.

--Chris Howe
System Architect
Civitas Learning

@aboytsov
Copy link
Contributor

Hello, Chris!

I can't tell you how excited we are to have the first major pull request from the Drake community! This addition is awesome and has been requested on multiple occasions. I will review it as soon as I can - should be today.

Regards,
Artem.

@aboytsov aboytsov mentioned this pull request Mar 23, 2013
@howech
Copy link
Contributor Author

howech commented Mar 23, 2013

Thanks!!!

@@ -0,0 +1,9 @@
(ns drake.options
(:refer-clojure :exclude [file-seq])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem like you need anything other than (ns drake.options) here. gen-class probably not needed also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct.

@aboytsov
Copy link
Contributor

It would be nice if we added S3 to resources/regtest/regtest_fs.(d|sh) or as a new regression test. regtest_fs currently creates some files on HDFS and tests a simple workflow that includes both HDFS & local files and combines data from both. In case there's no HDFS access on the tester's machine, it replaces it with a local directory instead (see regtest_fs.sh, it just sets up a couple of environment variables to achieve this effect). You could add an S3 node in a similar way.

@howech
Copy link
Contributor Author

howech commented Mar 23, 2013

These all seem pretty reasonable comments. I'll try to comply sometime this
afternoon.
On Mar 23, 2013 12:33 PM, "Artem Boytsov" notifications@github.com wrote:

It would be nice if we added S3 to resources/regtest/regtest_fs.(d|sh) or
add it to some other regression test. regtest_fs currently creates some
files on HDFS and tests a simple workflow that includes both HDFS & local
files and combines data from it. In case there's no HDFS access on the
tester's machine, it replaces it with a local directory instead (see
regtest_fs.sh, it just sets up a couple of environment variables to
achieve this effect). You could add an S3 node in a similar way.


Reply to this email directly or view it on GitHubhttps://github.com//pull/60#issuecomment-15341017
.

@aboytsov
Copy link
Contributor

No rush. I haven't finished it though - got pulled into something, I'll let you know when I finish if you want to take care of it all at the same time.

@@ -134,7 +136,7 @@

(defn- hdfs-file-info [status]
{:path (remove-hdfs-prefix (.toString (.getPath status)))
:mod-time (.getModificationTime status)
:mod-time (:last-modified status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it it's not an accidental change as it pertains to HDFS. Did you run HDFS-related regtests (regtest_fs.sh)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this was completely accidental. My bad.

On Sat, Mar 23, 2013 at 3:05 PM, Artem Boytsov notifications@github.comwrote:

In src/drake/fs.clj:

@@ -134,7 +136,7 @@

(defn- hdfs-file-info [status]
{:path (remove-hdfs-prefix (.toString (.getPath status)))

  • :mod-time (.getModificationTime status)
  • :mod-time (:last-modified status)

I take it it's not an accidental change as it pertains to HDFS. Did you
run HDFS-related regtests (regtest_fs.sh)?


Reply to this email directly or view it on GitHubhttps://github.com//pull/60/files#r3502082
.

(normalized-filename [_ path]
(join "/" (list "" (remove-extra-slashes path))))
(rm [_ path]
(let [bkt-key (s3-bucket-key path) ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use destructuring here, too, it will be shorter and sweeter :)

@aboytsov
Copy link
Contributor

Hello, Chris!

I have finished reviewing the code. As I said earlier, we are very excited to have you contribute to the Drake's source code. Drake's support of S3 is important to many people. I hope my comments make sense to you and you'd agree with most of them.

Thank you very much!

Yours,
Artem.

@ghost ghost assigned aboytsov Mar 23, 2013
@howech
Copy link
Contributor Author

howech commented Mar 25, 2013

I believe that I have made most of the requested changed. I was not entirely sure of the extent that you wanted me to use destructuring in a couple of places, but I made an attempt.

Also, I took all of the tests out of test/drake/test/fs.clj and added a couple of files to resources/regtest that test S3 functionality in a way parallel to the HDFS tests.

@aboytsov
Copy link
Contributor

Hey there, Chris! Thanks a lot for making all these changes - I travel tomorrow but will try to get them reviewed by Wednesday.

@aboytsov
Copy link
Contributor

Hello, Chris!

I have reviewed all your changes - seems like there's a small bug in mv. Other than that most of my notes are trivial. The code has improved a lot in readability - great work! Just a bit more to go, and I'll be very happy to pull your contribution to our code base.

Thank you!

Artem.

@howech
Copy link
Contributor Author

howech commented Mar 29, 2013

I think that I have made all of the changes you have asked for, except for the 80-character limit thing...

@aboytsov
Copy link
Contributor

LGTM.
Great work! Pulling it in now.

@dirtyvagabond dirtyvagabond merged commit 5947777 into Factual:develop Apr 2, 2013
@dirtyvagabond
Copy link
Contributor

Pushed to master and released with Drake 0.1.3. Thanks @howech!

@howech, I created this simple wiki page...
https://github.com/Factual/drake/wiki/S3-Compatibility
... but I haven't had the chance to create a workflow demo. Any chance you'd be willing to add a small illustrative workflow onto the wiki page?

@dirtyvagabond
Copy link
Contributor

@howech BTW, we announced the latest version, including your addition, here:
https://groups.google.com/forum/#!topic/drake-workflow/EQGyMQIMbb4

(wasn't sure if you're a subscriber on that list)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants