Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort out models, data, and tools #142

Merged
merged 17 commits into from Feb 26, 2014
Merged

Sort out models, data, and tools #142

merged 17 commits into from Feb 26, 2014

Conversation

shelhamer
Copy link
Member

  1. Re-arrange dirs for cleanliness.
  2. Move learned models and data to caffe-mug repo dropbox for now (and a suitable server later).
  3. Provide scripts to download models and data as needed.
  4. Ignore models and data in the main repo by default. This makes local experimentation convenient.
  5. Keep model definitions around.

Orchestrating updates between commits in caffe and uploaded models and data has overhead, but is worth the separation of concerns and keeping the repo lean.

@sergeyk
Copy link
Contributor

sergeyk commented Feb 25, 2014

Decided to have a separate repository for data, model definition files, and example models, with scripts in this repo to download them as needed. The separate repo will be stable against master, not dev. @shelhamer will do this.

@mavenlin
Copy link
Contributor

@sergeyk that will be nice, currently the repo is big and slow to download because of the synsets. Can these be removed from the git history also?

@shelhamer
Copy link
Member Author

@mavenlin the synsets are on my list for this reorganization. Filtering them from the history is necessary to save space, and a simple command, but it breaks history.

We'll consider such house cleaning when it comes time to release Caffe 1.0, but we're not going to rewrite history on master so casually now.

@shelhamer
Copy link
Member Author

@sergeyk I added you as a collaborator on my fork so that we can jointly take care of the documentation updates trigged by this PR. All you have to do is push to my fork's data-aux branch.

@shelhamer
Copy link
Member Author

Instead of scripts to pull models and data, a mug submodule for data and models was considered but this lacks choice: with a submodule it's all or nothing.

@shelhamer
Copy link
Member Author

This isn't going to work with github's (not unreasonable) file size and traffic limitations. git's a drag with large binary files too, so perhaps it's best.

The alternative is self-hosting from campus or ICSI.

@sergeyk
Copy link
Contributor

sergeyk commented Feb 25, 2014

Let's host as many models, sample data, and model def files as possible in
a github repo. For really large models, we can upload them to a publicly
accessible ICSI place, and note the version (just the date, probably) in
the filename.

On Tue, Feb 25, 2014 at 12:59 AM, Evan Shelhamer
notifications@github.comwrote:

This isn't going to work with github's (not unreasonable) file size and
traffic limitations. git's a drag with large binary files too, so perhaps
it's best.

The alternative is self-hosting from campus or ICSI.


Reply to this email directly or view it on GitHubhttps://github.com//pull/142#issuecomment-35987363
.

@shelhamer
Copy link
Member Author

Oh, sorry I wasn't clear. This isn't going to work at all. Not even the Caffe reference imagenet model fits on its own as there's a filesize cap of 100mb.

My fallback plan is ICSI hosting and versioning the fetch urls of the scripts in master. We can have a simple script to publish the models, defs, and data into a dir based on the timestamp.

@sergeyk
Copy link
Contributor

sergeyk commented Feb 25, 2014

I know that the reference imagenet model won't fit. I still think that prototxt files and small sample data should be hosted on github -- everything that fits under 100mb, which is gonig to be basically everything except imagenet models.

@shelhamer
Copy link
Member Author

Resolution: keep model definitions in the repo, drop included data, and add scripts to fetch learned models and data as needed. Auxiliary data and model weights will live on dropbox for the moment, and will find their permanent home on a Berkeley server after March 7. Our group will be bringing a demo server online after that date which can hold the data.

@shelhamer
Copy link
Member Author

This feels ready to me, modulo fixing the docs changes this triggers. @sergeyk, how about we update this once #155 is in?

@kloudkl
Copy link
Contributor

kloudkl commented Feb 25, 2014

It seemed not a high priority to have a demo (#78). May I ask what demo will the server host?

@shelhamer
Copy link
Member Author

Any suggestions on the dir structure or names are welcome–this is the time to arrange everything neatly.

@kloudkl re: demo, there will in fact be a Caffe demo along the lines of the DeCAF demo, and along with it other demos of our research group's projects. @Yangqing was not against a demo so much as spending too much time engineering a simple illustration of the framework and not focusing on the research hacking.

@sguada
Copy link
Contributor

sguada commented Feb 25, 2014

What about this dir structure?

  • data/ //Contains one folder per dataset. Each folder can have a script to get the data
    • mnist/
    • cifar10/
    • ilsvrc_2012/
    • ...
  • tools/ //Contains the main caffe tools for training, testing and finetunning, net_speed, ...
    • train_net.cpp
    • test_net.cpp
    • fine_tuninng.cpp
    • ...
  • models/ //Contains the different protxt defining the models
  • docs/ //Contains the documentation
  • examples/ //Could contain some samples of uses of caffe, but should mix everything

@shelhamer
Copy link
Member Author

That looks right, but I'm torn about how examples fit in. Packing example code, model, and data together makes the example clear, but reuse weird. I'll package purely example files up together, but keep data on its own.

Collect core Caffe tools like train_net, device_query, etc. together in
tools/ and include helper scripts under tools/extra.
Data, models, and examples should not be versioned by default. Reference
versions of these are not to be casually committed.

Plus this makes for a better playground in examples without having to
worry about data, intermediate files, or experiments being accidentally
tracked.
@shelhamer
Copy link
Member Author

Ok everyone, feast your eyes and let me know. Speak now or forever hold your peace.

@Yangqing @jeffdonahue @sergeyk @sguada @longjon

@jeffdonahue
Copy link
Contributor

Looks great to me! Thanks for the reorganization work @shelhamer.

@sguada
Copy link
Contributor

sguada commented Feb 26, 2014

It looks good but I cannot compile it due the hdf5 dependency introduced in #147
So not sure if it will work or not.

There are some small error in the get_data.sh
./get_ilsvrc_aux.sh: 9: ./get_ilsvrc_aux.sh: Bad substitution
./get_mnist.sh: 4: ./get_mnist.sh: Bad substitution
./get_cifar10.sh: 4: ./get_cifar10.sh: Bad substitution

@shelhamer
Copy link
Member Author

Perhaps we should shortlist this for master like Sergey's #157 since this changes a lot and everything should be merged into the new arrangement and not the old.

dev and feature branches will need to be rebased, so we should incorporate this, Jeff's #163 and Eric's #152 into master then rebase dev all at once.

@sguada
Copy link
Contributor

sguada commented Feb 26, 2014

That will solve the problems for now. We still need to figure out a way
either make hdf5 optional or to explain how to install it properly.

Sergio

2014-02-25 19:09 GMT-08:00 Evan Shelhamer notifications@github.com:

Perhaps we should merge this to master instead of dev like Sergey's #157https://github.com/BVLC/caffe/pull/157since a lot depends on this change. Then it would work cleanly for whoever
doesn't have hdf5 for now too.

dev and feature branches would need to be rebased, so we should
incorporate this, Jeff's #163 https://github.com/BVLC/caffe/pull/163and Eric's
#152 #152 into master then rebase
dev all at once.

Reply to this email directly or view it on GitHubhttps://github.com//pull/142#issuecomment-36086428
.

@sergeyk
Copy link
Contributor

sergeyk commented Feb 26, 2014

I'm for merging this one and #163 into master directly, but not #152, as it does not currently exist and is an involved code change.

@sergeyk
Copy link
Contributor

sergeyk commented Feb 26, 2014

Looks good to me, we'll fix potential mistakes once we merge.

@shelhamer
Copy link
Member Author

@sguada there's no script get_data.sh? I tested the fetch scripts on osx and ubuntu.

@sergeyk. Agreed on all counts. I'll merge soon.

@sguada
Copy link
Contributor

sguada commented Feb 26, 2014

@shelhamer I meant all the get_ilsvrc_aux.sh, get_mnist.sh
But maybe it is just me.

@sguada
Copy link
Contributor

sguada commented Feb 26, 2014

@shelhamer beside that it is great, ready to merge

@shelhamer
Copy link
Member Author

@sguada you might have some kind of weird shell. Check which sh perhaps?

shelhamer added a commit that referenced this pull request Feb 26, 2014
Sort out models, data, examples, and tools
@shelhamer shelhamer merged commit d323547 into BVLC:dev Feb 26, 2014
@shelhamer shelhamer deleted the data-aux branch February 26, 2014 03:56
@shelhamer shelhamer mentioned this pull request Feb 26, 2014
shelhamer added a commit that referenced this pull request Feb 26, 2014
Sort out models, data, examples, and tools
shelhamer added a commit that referenced this pull request Feb 26, 2014
Sort out models, data, examples, and tools
mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014
Sort out models, data, examples, and tools
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants