Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Documentation Enhancement #32

Closed
rasbt opened this issue Nov 25, 2015 · 23 comments
Closed

Project Documentation Enhancement #32

rasbt opened this issue Nov 25, 2015 · 23 comments

Comments

@rasbt
Copy link
Contributor

rasbt commented Nov 25, 2015

I was thinking that it may be worthwhile setting up a project documentation page other than this github repo -- for example, via Sphinx or MkDocs. This would have the advantage to create & organize an API documentation and tutorials/examples. I could set up something like at http://rasbt.github.io/biopandas/ if you'd find it useful.

@rhiever
Copy link
Contributor

rhiever commented Nov 25, 2015

What's the advantage over a standard README? How tough is it to maintain?

On Tuesday, November 24, 2015, Sebastian Raschka notifications@github.com
wrote:

I was thinking that it may be worthwhile setting up a project
documentation page other than this github repo -- for example, via Sphinx
or MkDocs. This would have the advantage to create & organize an API
documentation and tutorials/examples. I could set up something like at
http://rasbt.github.io/biopandas/ if you'd find it useful.


Reply to this email directly or view it on GitHub
#32.

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

@rasbt
Copy link
Contributor Author

rasbt commented Nov 25, 2015

Well, of course you can always put 'everything' into a README file as well, but depending on future additions, this README file can become huge and user unfriendly. I'd say it's the same reason why people don't build websites as 1 large html/text file ...
I think for larger projects, breaking it -- the documentation -- down into logical sections (e.g., one document to list and describe version changes, one to document the API, and several ones for tutorials/examples) wouldn't hurt.
I think that a README file is important though, it should certainly contain the most important information about a project.

@rhiever
Copy link
Contributor

rhiever commented Nov 25, 2015

Doesn't hurt to have the web page docs then. I don't think the project is
large enough to merit that yet, but we will probably get there soon.

On Wednesday, November 25, 2015, Sebastian Raschka notifications@github.com
wrote:

Well, of course you can always put 'everything' into a README file as
well, but depending on future additions, this README file can become huge
and user unfriendly. I'd say it's the same reason why people don't build
websites as 1 large html/text file ...
I think for larger projects, breaking it -- the documentation -- down into
logical sections (e.g., one document to list and describe version changes,
one to document the API, and several ones for tutorials/examples) wouldn't
hurt.
I think that a README file is important though, it should certainly
contain the most important information about a project.


Reply to this email directly or view it on GitHub
#32 (comment).

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

@pronojitsaha
Copy link
Contributor

Hi @rhiever & @rasbt, I am quite interested & motivated by the possibilities & impact potential of TPOT. If possible, I would like to contribute to it and I think starting with the project documentation would be good, if you require? Look forward to hear from you guys.

Thanks.

@rasbt
Copy link
Contributor Author

rasbt commented Nov 25, 2015

@rhiever Yes, I was also thinking more in terms of "in the long run." It would certainly help though to start early and document "as we go."

If we were to set up a project documentation, we probably want to use a static html builder like Sphinx, MkDocs, or Jekyll. I think it's typical for Python projects to use Sphinx. It's really a neat tool, but it's also a pretty complex beast, and personally, I find that the default themes are really clunky and ugly. I think MkDocs would work just fine and I don't see any disadvantage of using Markdown over the restructured text format.

Once it's setup, it's actually pretty easy to maintain:

  1. make a change in the markdown file(s)
  2. view the changes live via mkdocs serve
  3. build the HTML via mkdocs build --clean
  4. deploy the changes to Gihub-Pages via mkdocs gh-deploy

That's basically it.

@rhiever
Copy link
Contributor

rhiever commented Nov 25, 2015

I would be happy for you two to take the helm on establishing the project
docs. Once I get back on Monday, I'll be focusing on development again.

On Wednesday, November 25, 2015, Sebastian Raschka notifications@github.com
wrote:

@rhiever https://github.com/rhiever Yes, I was also thinking more in
terms of "in the long run." It would certainly help though to start early
and document "as we go."

If we were to set up a project documentation, we probably want to use a
static html builder like Sphinx, MkDocs, or Jekyll. I think it's typical
for Python projects to use Sphinx. It's really a neat tool, but it's also a
pretty complex beast, and personally, I find that the default themes are
really clunky and ugly. I think MkDocs would work just fine and I don't see
any disadvantage of using Markdown over the restructured text format.

Once it's setup, it's actually pretty easy to maintain:

  1. make a change in the markdown file(s)
  2. view the changes live via mkdocs serve
  3. build the HTML via mkdocs build --clean
  4. deploy the changes to Gihub-Pages via mkdocs gh-deploy

That's basically it.


Reply to this email directly or view it on GitHub
#32 (comment).

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

@rasbt
Copy link
Contributor Author

rasbt commented Nov 25, 2015

@rhiever @pronojitsaha Alright, sounds like a plan. I suggested to setup the MkDocs framework with API generator and stuff as I've done this for other projects already, but if @pronojitsaha wants to do it, it would be fine with me too. Just let me know so that we don't implement the same thing twice :).

@pronojitsaha
Copy link
Contributor

@rhiever & @rasbt thanks.

@rasbt As you already have a similar framework in place, I dont believe its make sense to reinvent the wheel again! You can share the existing framework as a separate repository and then we can decide the structure and contribute to individual pages as mutually decided. Does that work for you?

@rasbt
Copy link
Contributor Author

rasbt commented Dec 1, 2015

@rhiever @pronojitsaha Sorry for the late response, I took a few days off over the long Thanksgiving weekend. Unfortunately, I am in the midst of wrapping up a few research projects before I I'll go on vacation in a few days so I probably wouldn't get to it before January. But setting up a basic framework via Sphinx or Mkdocs should be pretty straight-forward I guess. The gplearn library is actually a nice, lean example: https://gplearn.readthedocs.org/en/latest/examples.html

I would suggest using the Readme file as a template; I think the goal of the documentation would be a to have an "appealing" with a convenient navigation to find relevant information.
I think that it will definitely pay off in the long run when the code base grows (regarding the API documentation) as well as the number of tutorials and examples.

Maybe I'd start with the following sections/pages

  • "Contributing" (basic GitHub instructions: filing issues, forking, and pull requests etc.)
  • "Version History" (keeping track of new features and changes over time)
  • "API documentation" i.e., auto-parsing the docstrings
  • "Installation"
  • "Tutorials" or "Examples"

@pronojitsaha
Copy link
Contributor

@rasbt Hope you had a good thanksgiving. Ok, I will look into it and setup the initial framework using Mkdocs which we can later work on together once you are available in January. Enjoy the vacation!

@rasbt
Copy link
Contributor Author

rasbt commented Dec 1, 2015

@pronojitsaha Just got home and read your message; I thought: up the template literally just takes 10 minutes, let's do this ;). See pull request #35

I basically just pasted the sections from the Readme file for now, you can see it live at
http://rasbt.github.io/tpot/

(If you fetch or merge it, you can see it live locally by running mkdocs serve from the docs/source directory -- by default it's http://127.0.0.1:8000/.)

So, I guess I'll take a look at the API documentation in January then, but I wanted to set this up so that you guys can maybe write the rest of the documentation and come up with some more examples and tutorials or so in the mean time.

@pronojitsaha
Copy link
Contributor

@rasbt Ok, great! Will dwell into it further.

@rhiever
Copy link
Contributor

rhiever commented Dec 2, 2015

Thanks for the great start on these docs. I've merged #35.

@rhiever
Copy link
Contributor

rhiever commented Dec 2, 2015

@rasbt, I've been updating the docs for the new export functionality and it takes double the work to update both the README and the docs. Any recommendations to avoid this duplication of labor?

@pronojitsaha, now that we have docs up and running, I can think of a couple things that would be invaluable at this point:

  1. Not all of the public TPOT functions are thoroughly documented. fit, score, and export in particular need more documentation since those are the primary functions that people will be using. Currently we have a basic example of using them in the README, but it'd be great to expand on those docs and go into detail on what each function -- and what parameter of each function -- does.

  2. More examples are always welcome! Currently we only have the MNIST example from sklearn, but it'd be great to provide code examples of many different types of data sets.

@rasbt
Copy link
Contributor Author

rasbt commented Dec 2, 2015

@rhiever I'd recommend not to cram too much into the README file but focus on the "essentials" like an overview, a quick example, installation, license info, and short contributing info. I would insert a "important links section at the top pointing to the actual documentation then.
Otherwise, I'd suggest to just assemble the README.md, e.g.,

cat index.md installation.md contributing.md MNIST_Example.md ... > README.md

@pronojitsaha
Copy link
Contributor

@rhiever Ok, I will look into the two points. I understand at this point we have only implemented for classifications tasks, so for examples following are few data sets in my mind, please let me know your views:

  1. Iris Dataset
  2. Titanic Dataset
  3. Lending Club Data
  4. Facial Keypoint Detection
  5. Forest Cover Type Dataset

However, hardware is a challenge as increasing data set sizes will slow down TPOT considerably and increase the time involvement. This also applies for #41 for unit testing. As such have you thought of having EC2 instances for this project or any other alternative to account for this?

@UniqueFool
Copy link

hardware is a challenge as increasing data set sizes will slow down TPOT considerably

FWIW, other Python based GP projects tend to use OpenCL/PyOpenCL to make better use of dedicated CPU/GPU and FPGA resources. In fact, a number are even using CUDA (which is NVIDIA specific)

@rhiever
Copy link
Contributor

rhiever commented Dec 7, 2015

For now, I think we'll stick to smaller data sets (e.g., the sklearn MNIST subset) for the examples in the docs. i.e., examples that can be executed and see results in less than 10 minutes. I wouldn't want to require the user to fire up an EC2 instance or hop on a HPCC to run a basic TPOT example.

However, for some use cases it may take several hours to run TPOT -- especially with large data sets -- and I think it would be a good idea to note that in the docs. Perhaps in an "Expectations for TPOT" section of the docs?

@UniqueFool
Copy link

Note that OpenCL is just an abstraction mechanism, i.e. the underlying "kernels" (C-like code) will work on CPUs, GPUs and FPGA hardware.
Wrappers like pyopencl even hide the nitty gritty details and expose all this flexibility to scripting space, which means that a python script can implement heavy algorithms as "kernels" that will automatically make use of dedicated hardware if available.
the only real issue is that OpenCL does not currently lend itself to clustering/distribution.

Since you mention MNIST: I suggest to run a corresponding google search, there are a number of examples where using the GPU instead of the CPU (via OpenCL/CUDA) provided a x100 factor speedup when using the MNIST dataset, e.g. see: http://corpocrat.com/2014/11/09/running-a-neural-network-in-gpu/ (note that this is also using python and skl)

http://www.cs.berkeley.edu/~demmel/cs267_Spr11/Lectures/CatanzaroIntroToGPUs.pdf

@pronojitsaha
Copy link
Contributor

@rhiever Ok, I think it makes sense to work on small/sub sets now and focus more on the implementation of the examples. Will look into it.

@bartleyn
Copy link
Contributor

Anyone working on documenting the pipeline operators and public functions? I've made some significant headway on it, but want to make sure I'm not duplicating labor.

@pronojitsaha
Copy link
Contributor

Hi @bartleyn, I am not working on those at the moment.

@rhiever
Copy link
Contributor

rhiever commented Jan 18, 2016

PR #71 is related and still in review (will get to it soon, promise -- I'm back from vacation now), but otherwise I believe that's the only pending change to the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants