Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Support #718

Merged
merged 23 commits into from Jul 30, 2019
Merged

Docker Support #718

merged 23 commits into from Jul 30, 2019

Conversation

yueguoguo
Copy link
Collaborator

@yueguoguo yueguoguo commented Apr 8, 2019

Description

Initialize a PR of Docker support for PySpark environment. This PR is for discussion with team to brainstorm and optimize Docker support for the repo.

NOTE

  • I did not use conda yaml file in the repo to build conda env because the base image from https://github.com/jupyter/docker-stacks handles Jupyter kernel separately and has already installed many packages that exist in our yaml file.
  • To make the image light weighted, the conda/pip packages duplicated in both the base image and those in our yaml file are removed

TODO

  • Create pre-built image and publish in Docker hub
  • Test running the Docker container - I see this a good example. Maybe we want to adopt it?
  • Finish the rest of the three Docker images, i.e., CPU and GPU

HOW-TO
A sample image is created and publicitized on my own Docker hub account (we can and should create one for the team later on)
In a Linux terminal or Windows powershell (assume Docker is pre-installed in the machine)

docker pull yueguoguo/reco_pyspark:latest
docker run --rm -p 8888 yueguoguo/reco_pyspark

Open browser and go to localhost:8888 with the token generated in the above run of the image.

UPDATE
2019-04-08

  • The pre-built image refers to a branch in the repo where there are only notebooks executable in the environment. For example, in the PySpark environment, the deep learning notebooks which are supposed to run in a GPU environment, are removed, because the Python packages for these notebooks are not installed for the light-weight consideration.

2019-06-21

  • "one to bind all". The same Dockerfile can be used for building an image with different environment, i.e., CPU, GPU, and Spark. This can be done by using the specific build args, i.e., cpu, gpu, and pyspark, respectively
  • SETUP.md is updated accordingly
  • Master branch of the repo will be cloned

Related Issues

Discussed in #687

Checklist:

  • My code follows the code style of this project, as detailed in our contribution guidelines.
  • I have added tests.
  • I have updated the documentation accordingly.

@yueguoguo yueguoguo added the needs discussion Needs further discusssion label Apr 8, 2019
Copy link
Collaborator

@miguelgfierro miguelgfierro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great Le!

reco_utils/docker/pyspark/Dockerfile Outdated Show resolved Hide resolved
reco_utils/docker/pyspark/Dockerfile Outdated Show resolved Hide resolved
reco_utils/docker/pyspark/Dockerfile Outdated Show resolved Hide resolved
reco_utils/docker/pyspark/Dockerfile Outdated Show resolved Hide resolved
SETUP.md Outdated

Assuming Docker is pre-installed and configured (**NOTE** `docker` command is already available in the Azure Data Science Virtual Machine. For installation of Docker in on a Linux machine, check instructions [here](https://docs.docker.com/install/)), for example, a PySpark Docker image can be built by the following command line
> ```{shell}
> cd Recommenders/reco_utils/docker/pyspark
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been looking at where other libs put the docker files, both TF and Pytorch put it under the folder tools.

Maybe we should consider renaming scripts to tools, this was discussed several times @bethz

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on renaming

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could just be
docker build -t <image_name>/<tag> -f reco_utils/docker/pyspark

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the name tools

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yueguoguo is the plan to finally create the tools folder and add the docker there?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gramhagen what is the final decision for the location of docker?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do people think about renaming scripts -> tools
then the structure for this would be

- tools/
    - docker/
        - Dockerfile
    - repo_metrics/
    - databricks_install.py
    - generate_conda_file.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. Alternatively, we may want to put docker at the top-level? I see many other repos do this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair, I've seen this quite a bit too. ok. I'm leaning towards keeping this in docker folder at the top level, that's not too cluttered.

we can rename scripts in a separate PR.

@yueguoguo yueguoguo changed the title Docker PySpark Docker PySpark - DISCUSSION AND IDEAS NEEDED Apr 8, 2019
Copy link
Collaborator

@gramhagen gramhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty cool, i just have a few questions

SETUP.md Outdated Show resolved Hide resolved
SETUP.md Outdated

Assuming Docker is pre-installed and configured (**NOTE** `docker` command is already available in the Azure Data Science Virtual Machine. For installation of Docker in on a Linux machine, check instructions [here](https://docs.docker.com/install/)), for example, a PySpark Docker image can be built by the following command line
> ```{shell}
> cd Recommenders/reco_utils/docker/pyspark
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could just be
docker build -t <image_name>/<tag> -f reco_utils/docker/pyspark

reco_utils/docker/pyspark/Dockerfile Outdated Show resolved Hide resolved
reco_utils/docker/pyspark/Dockerfile Outdated Show resolved Hide resolved
reco_utils/docker/pyspark/Dockerfile Outdated Show resolved Hide resolved
reco_utils/docker/pyspark/Dockerfile Outdated Show resolved Hide resolved
@loomlike
Copy link
Collaborator

@yueguoguo @miguelgfierro What's the status of this?

Also wonder about if we want to have non-pyspark (gpu or cpu) docker support. I like to have that for two reasons:

  1. No hassle to install env (I heard one of our Garage intern says having some issues of installing env and running notebooks).
  2. Make it easy to use container orchestration solutions or even AzureML

@miguelgfierro
Copy link
Collaborator

I guess we can have 3 versions (cpu, gpu, spark), not sure if we want to also have the full environment

@yueguoguo
Copy link
Collaborator Author

yueguoguo commented Jun 12, 2019

@loomlike I am revisiting this part these days.

Quick ones re your ask:

No hassle to install env (I heard one of our Garage intern says having some issues of installing env and running notebooks).

The Docker support in this PR is based on the Jupyter Dockerfile so it is for launching a Jupyter notebook on whatever desired platform where Docker is run, to run the notebooks in our repo, without upfront environment setup. So this scenario will be nice fit.

If it is urgently needed, you can use the image I have built previously. E.g., PySpark one can be found here. Instructions for use can be found in the REAMDE of the branch in this PR.

Make it easy to use container orchestration solutions or even AzureML

Need modification to better support this scenario.

@gramhagen
Copy link
Collaborator

Another thought here. It might actually be preferable to avoid images that have libraries pre installed as that will introduce potential differences from our existing setup.

@yueguoguo
Copy link
Collaborator Author

Currently, using the setup instruction directly in the docker image build yields an image with size of 22GB...

@miguelgfierro
Copy link
Collaborator

Currently, using the setup instruction directly in the docker image build yields an image with size of 22GB...

22? that's crazy. Do you know what is making the image so big?

@gramhagen
Copy link
Collaborator

I was looking into building a docker image for the reco_base env. and then thought I would build off what you have here to try out working from minimal ubuntu image. The final spark image is only 6.5GB which seems reasonable. I did have to make some changes to the spark test setup though (something bad happens if you allocate too little memory in docker).

I put some of the work I've been doing here: https://github.com/microsoft/recommenders/tree/gramhagen/spark_test_fix

there's a tools/docker folder with two images. Maybe we can consolidate these two branches?

@yueguoguo
Copy link
Collaborator Author

@gramhagen FYI I updated the Dockerfile to reduce the dependencies of the packages in the built image. It also allows users to specify argument value for building image with different environment.

Sure let's see how to merge the work together.

SETUP.md Outdated

Assuming Docker is pre-installed and configured (**NOTE** `docker` command is already available in the Azure Data Science Virtual Machine. For installation of Docker in on a Linux machine, check instructions [here](https://docs.docker.com/install/)), for example, a PySpark Docker image can be built by the following command line
> ```{shell}
> cd Recommenders/reco_utils/docker/pyspark
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yueguoguo is the plan to finally create the tools folder and add the docker there?

docker/Dockerfile Outdated Show resolved Hide resolved
@gramhagen
Copy link
Collaborator

have you tried to run this on a dsvm? I just built the image and ran into a few problems. It looks like jupyter is setup to require a token, but it is also listening to traffic on 127.0.0.1, which will cause problems if you're running on a vm. I think the jupyter ip address should be set to 0.0.0.0.

also, it might be easier to just disable the token based authentication for jupyter?

I tried to run some of the tests and they failed because they could not find the reco_utils package, maybe this needs to be added to the pythonpath?

on the plus side the image size is down to ~6GB, so close to what I was seeing with the image built from scratch. I'm still a bit concerned about introducing so many discrepancies btw the image and the default installation based on full conda install. even the version of python used is different (3.7 vs 3.6).

Other than speed of install do you think there's an advantage to going with this approach vs. clean install off ubuntu image?

@yueguoguo
Copy link
Collaborator Author

have you tried to run this on a dsvm? I just built the image and ran into a few problems. It looks like jupyter is setup to require a token, but it is also listening to traffic on 127.0.0.1, which will cause problems if you're running on a vm. I think the jupyter ip address should be set to 0.0.0.0.

I tried running it on the local laptop, and yea it listens to 127.0.0.1. When it is run, the token will be generated which is directly embedded in the url. What problem did you get?

also, it might be easier to just disable the token based authentication for jupyter?

Yea I think it is doable. This is the default setting in the original base image.

I tried to run some of the tests and they failed because they could not find the reco_utils package, maybe this needs to be added to the pythonpath?

This is strange. All should be the same as clone into a local environment...

on the plus side the image size is down to ~6GB, so close to what I was seeing with the image built from scratch. I'm still a bit concerned about introducing so many discrepancies btw the image and the default installation based on full conda install. even the version of python used is different (3.7 vs 3.6). Other than speed of install do you think there's an advantage to going with this approach vs. clean install off ubuntu image?

Ah just realized that the Python version is 3.7. They just updated the Python version in the base image... Probably a tag pointing to the right version needs to be used. IMHO, the current base image is for the convenience to use Jupyter notebooks which are the major assets in our repository. We can definitely do everything from scratch from a ubuntu base image, but maybe that will be to reinvent the wheel... What do you think?

@gramhagen
Copy link
Collaborator

I wasn't able to get to the jupyter server on the container when I ran it remotely, I created an ssh tunnel fwding port 8888 to my machine. I suspect using 0.0.0.0 would work better. I can't remember if you see output from the container if you run it in detached mode? If not then you won't be able to see the token. I'm not sure we're worried too much about security inside the container?

I think it's worth building from scratch as we can lock down exactly what's used and we can leverage the same setup as is done locally. I don't think there's a whole lot going on that would be reinvented. Basically all that's needed is some apt-installs, install spark, anaconda, then grab our repo setup an environment and adjust jupyter config settings. The only disadvantage is it's slower to build the image.

@loomlike
Copy link
Collaborator

@yueguoguo Can we have another ARG to be able to specify a specific branch of the recommenders repo?

Example use case will be -- If I'm working on something on my branch, and I want to use this Dockerfile to create a docker image containing all of my changes (in my branch).

@yueguoguo
Copy link
Collaborator Author

@gramhagen

I wasn't able to get to the jupyter server on the container when I ran it remotely, I created an ssh tunnel fwding port 8888 to my machine. I suspect using 0.0.0.0 would work better. I can't remember if you see output from the container if you run it in detached mode? If not then you won't be able to see the token. I'm not sure we're worried too much about security inside the container?

Ah I see. I have not tried forward a container run on a remote machine (say a VM) and tunnel it to a local one. I ran it on the local machine directly with docker run --rm -p 8888 yueguoguo/reco_pyspark and it generated the output, thus token. I believe running in a detached mode will not give the interactive output. The token and security come from the base image, and sure if there is no particular concern about it we can remove that.

I think it's worth building from scratch as we can lock down exactly what's used and we can leverage the same setup as is done locally. I don't think there's a whole lot going on that would be reinvented. Basically all that's needed is some apt-installs, install spark, anaconda, then grab our repo setup an environment and adjust jupyter config settings. The only disadvantage is it's slower to build the image.

Yea make sense. Using the same approach to set up the whole environment as we do in the repo generated a super big image (like what I tried previously) that needs to be optimized

@yueguoguo
Copy link
Collaborator Author

@yueguoguo Can we have another ARG to be able to specify a specific branch of the recommenders repo? Example use case will be -- If I'm working on something on my branch, and I want to use this Dockerfile to create a docker image containing all of my changes (in my branch)

Yea sure actually it was included in the last PR but I removed it, as the idea is to provide user (of the Docker image) a stable version. It certainly makese sense to have the variable to allow developers to customize an image for their own use.

@gramhagen
Copy link
Collaborator

@yueguoguo here's a gist for 3 images, they work fine from both the notebook and pytest perspectives (running locally or on remote vm). The final images are ~5-8GB. Thoughts on how what we should incorporate here?

https://gist.github.com/gramhagen/c572d9677ccfb43a323f7e42b071c059

@yueguoguo
Copy link
Collaborator Author

@gramhagen It looks great. Just one general question, do we want to have just one Dockerfile with probably multiple stages in it, such that user can build with a desirable environment by passing the corresponding argument - there is something in common in all of the three Dockerfiles. What do you think?

@gramhagen
Copy link
Collaborator

@yueguoguo yes I think that's a good idea, I can try using your conditional logic to consolidate them. Would be nice to only have one to maintain.

Copy link
Collaborator

@gramhagen gramhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, looks like I need to resolve changes to setup.md vs docker/readme.md

@gramhagen gramhagen removed the needs discussion Needs further discusssion label Jul 29, 2019
@gramhagen gramhagen changed the title Docker PySpark - DISCUSSION AND IDEAS NEEDED Docker Support Jul 29, 2019
@yueguoguo
Copy link
Collaborator Author

@gramhagen FYI I added a note for exporting the environment variable of DOCKER_BUILDKIT to make sure the multi-stage build runs well.

In general, I think the PR is good to go, and maybe we want to merge it into master?

@gramhagen gramhagen merged commit 83e0d69 into staging Jul 30, 2019
@gramhagen
Copy link
Collaborator

Ok, cool. Merged this to staging, will create a PR for merging to master.

@gramhagen gramhagen mentioned this pull request Jul 30, 2019
3 tasks
@miguelgfierro miguelgfierro deleted the le_docker branch July 30, 2019 14:25
gramhagen added a commit that referenced this pull request Jul 31, 2019
* new file with wikidata functions

* fix in json extraction

* new notebook with wikidata use examples

* retry request with lowercase in case of failure

* WIP: example creating KG from movielens entities

* introduced new step to retrieve first page title from a text query in wikipedia

* updated movielens links extraction using wikidata

* adapted docstrings for sphinx and removed parenthesis from output

* added description and labels to nodes to graph preview

* #778 (comment) new format for queries

* raising exceptions in requests and using get() to retrieve dict values

* moved imports to first cell and movielens size as a parameter

* output file name as paramenter

* DATA: update sum check

* adding unit test for sum to 1 issue

* improved description and adapted to tests

* improved Exception descriptions

* integration tests

* unit tests

* added wikidata_KG to conftest

* changed name notebook

* *NOTE: Adding  shows the computation time of all tests.*

* imports up

* Update wikidata.py

* changed default parameter of sample for tests

* Add sphinx documentation for wikidata

* modified parameter extraction for tests

* added parameters tag to cell

* changed default sampling to test parameters in test

* notebook cleaned cells output

* Docker Support (#718)

* DOCKER: add pyspark docker file

* DOCKER: remove unused line

* DOCKER: remove old file

* DOCKER: add SETUP text

* DOCKER: add azureml`

* DOCKER: udpate dockerfile

* DOCKER: use a branch of the repo

* SETUP: update setup

* DOCKER: update dockerfile

* DOC: update setup

* DOCKER: one that binds all

* SETUP: update docker use

* DOCKER: move to top level

* SETUP: use a different base name

* DOCKER: use the same keywords in the repo for environment arg

* SETUP: update environment variable names

* updating dockerfile to use multistage build and adding readme

* adding full stage

* fixing documentation

* adding info for running full env

* README: update notes for exporting environment on certain platform

* README: updated with example on Windows

* README: fix typo
@miguelgfierro miguelgfierro mentioned this pull request Aug 1, 2019
3 tasks
yueguoguo added a commit that referenced this pull request Sep 9, 2019
* DOCKER: add pyspark docker file

* DOCKER: remove unused line

* DOCKER: remove old file

* DOCKER: add SETUP text

* DOCKER: add azureml`

* DOCKER: udpate dockerfile

* DOCKER: use a branch of the repo

* SETUP: update setup

* DOCKER: update dockerfile

* DOC: update setup

* DOCKER: one that binds all

* SETUP: update docker use

* DOCKER: move to top level

* SETUP: use a different base name

* DOCKER: use the same keywords in the repo for environment arg

* SETUP: update environment variable names

* updating dockerfile to use multistage build and adding readme

* adding full stage

* fixing documentation

* adding info for running full env

* README: update notes for exporting environment on certain platform

* README: updated with example on Windows

* README: fix typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants