Skip to content

flaviostutz/datascience-tools

Repository files navigation

Datascience tools container

This container was created to support various experimentations on Datascience, mainly in the context of Kaggle competitions.

Bundled tools:

  • Based on Ubuntu 16.04
  • Python 3
  • Jupyter
  • TensorFlow (CPU and GPU flavors)
  • Spark driver (set SPARK_MASTER ENV pointing to your Spark Master)
  • Scoop, h5py, pandas, scikit, TFLearn, plotly
  • pyexcel-ods, pydicom, textblob, wavio, trueskill, cytoolz, ImageHash...

Run container:

  • CPU only:

    • create docker-compose.yml
    version: "3"
    services:
      datascience-tools:
        image: flaviostutz/datascience-tools
        ports:
          - 8888:8888
          - 6006:6006
        volumes:
          - /notebooks:/notebooks
        environment:
          - JUPYTER_TOKEN=flaviostutz
    
    • docker-compose up
  • GPU support for TensorFlow:

    • Prepare host machine with NVIDIA Cuda drivers
    • Install nvidia-docker and nvidia-docker-plugin
      • wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.0/nvidia-docker_1.0.0-1_amd64.deb
      • sudo dpkg -i /tmp/nvidia-docker*.deb && rm /tmp/nvidia-docker*.deb
      • Install nvidia-docker (https://github.com/NVIDIA/nvidia-docker)
    • nvidia-docker run -d -v /root:/notebooks -v /root/input:/notebooks/input -v /root/output:/notebooks/output -p 8888:8888 -p 6006:6006 --name jupyter flaviostutz/datascience-tools:latest-gpu
  • If you wish this container to run automatically on host boot, add these lines to /etc/rc.local:

    • cd /root/datascience-tools/run ./boot.sh >> /var/log/boot-script
    • Change "/root/datascience-tools" to where you cloned this repo

Access:

  • http://[ip]:8888 for Jupyter
  • http://[ip]:6006 for TensorBoard

Autorun script

  • When this container starts, it runs:
    • Jupyter Notebook server on port 8888
    • TensorBoard server on port 6006
    • A custom script located at /notebooks/autorun.sh
      • If autorun.sh doesn't exist, it is ignored
      • If it exists, everytime you start/restart the container it will be run once
      • You can use this script when running large batch processes on servers that could boot/shutdown at random (like what happens when using AWS Spot Instances), so that when the server restarts this script can resume previous work
      • Make sure you control partial save/resume for optimal computing usage
      • On the host OS, you have to run this docker container with "--restart=always" so that it will be started automatically during boot
      • It is possible to edit this file with Jupyter editor
      • Example script:
        • #!/bin/bash python test.py

Build instructions

  • docker build . -f Dockerfile
  • docker build . -f Dockerfile-gpu

Tips for development of your own Notebooks

  • A good practice is to store your notebook scripts in a git repository

  • Run datascience-tools container and map the volume "/notebooks", inside the container, to the path you cloned your git repository in your computer

  • You can edit/save/run the scripts from the web interface (http://localhost:8888) or directly with other tools on your computer. You can commit and push your code to the repository directly (no copy from/to container is needed because the volume is mapped)

version: "3"
services:
   datascience-tools:
      image: flaviostutz/datascience-tools
      ports:
      - 8888:8888
      - 6006:6006
      volumes:
      - /Users/flaviostutz/Documents/development/flaviostutz/puzzler/notebooks:/notebooks
  • For running in production, create a new container with "FROM flaviostutz/datascience-tools" and add your script files to "/notebooks" so when you run the container it will have your custom scripts embedded into it. No "volume" mapping is needed for this container. During container startup, script /notebooks/autorun.sh will run if present.

ENVs variables

  • JUPYTER_TOKEN - token needed for the users to open Jupyter. defaults to '', so that no token or password will asked to the user

  • SPARK_MASTER - Spark master address. Used if you want to send jobs to an external Spark cluster and still control the whole job from Jupyter Notebook itself.

About

Various datascience tools bundled in a single container: TensorFlow with GPU support, Jupyter, IPython, Scoop, h5py, pandas, scikit, TFLearn, plotly...

Resources

License

Stars

Watchers

Forks

Packages

No packages published