Docker Support #718

yueguoguo · 2019-04-08T08:11:16Z

Description

Initialize a PR of Docker support for PySpark environment. This PR is for discussion with team to brainstorm and optimize Docker support for the repo.

NOTE

I did not use conda yaml file in the repo to build conda env because the base image from https://github.com/jupyter/docker-stacks handles Jupyter kernel separately and has already installed many packages that exist in our yaml file.
To make the image light weighted, the conda/pip packages duplicated in both the base image and those in our yaml file are removed

TODO

Create pre-built image and publish in Docker hub
Test running the Docker container - I see this a good example. Maybe we want to adopt it?
Finish the rest of the three Docker images, i.e., CPU and GPU

HOW-TO
A sample image is created and publicitized on my own Docker hub account (we can and should create one for the team later on)
In a Linux terminal or Windows powershell (assume Docker is pre-installed in the machine)

docker pull yueguoguo/reco_pyspark:latest
docker run --rm -p 8888 yueguoguo/reco_pyspark

Open browser and go to localhost:8888 with the token generated in the above run of the image.

UPDATE
2019-04-08

The pre-built image refers to a branch in the repo where there are only notebooks executable in the environment. For example, in the PySpark environment, the deep learning notebooks which are supposed to run in a GPU environment, are removed, because the Python packages for these notebooks are not installed for the light-weight consideration.

2019-06-21

"one to bind all". The same Dockerfile can be used for building an image with different environment, i.e., CPU, GPU, and Spark. This can be done by using the specific build args, i.e., cpu, gpu, and pyspark, respectively
SETUP.md is updated accordingly
Master branch of the repo will be cloned

Related Issues

Discussed in #687

Checklist:

My code follows the code style of this project, as detailed in our contribution guidelines.
I have added tests.
I have updated the documentation accordingly.

miguelgfierro

this is great Le!

reco_utils/docker/pyspark/Dockerfile

miguelgfierro · 2019-04-08T08:56:47Z

SETUP.md

+
+Assuming Docker is pre-installed and configured (**NOTE** `docker` command is already available in the Azure Data Science Virtual Machine. For installation of Docker in on a Linux machine, check instructions [here](https://docs.docker.com/install/)), for example, a PySpark Docker image can be built by the following command line
+>   ```{shell}
+>   cd Recommenders/reco_utils/docker/pyspark


I've been looking at where other libs put the docker files, both TF and Pytorch put it under the folder tools.

Maybe we should consider renaming scripts to tools, this was discussed several times @bethz

+1 on renaming

this could just be
docker build -t <image_name>/<tag> -f reco_utils/docker/pyspark

+1 on the name tools

@yueguoguo is the plan to finally create the tools folder and add the docker there?

@gramhagen what is the final decision for the location of docker?

What do people think about renaming scripts -> tools
then the structure for this would be

- tools/ - docker/ - Dockerfile - repo_metrics/ - databricks_install.py - generate_conda_file.py

I think this looks good. Alternatively, we may want to put docker at the top-level? I see many other repos do this.

fair, I've seen this quite a bit too. ok. I'm leaning towards keeping this in docker folder at the top level, that's not too cluttered.

we can rename scripts in a separate PR.

gramhagen

pretty cool, i just have a few questions

SETUP.md

gramhagen · 2019-04-08T14:35:48Z

SETUP.md

+
+Assuming Docker is pre-installed and configured (**NOTE** `docker` command is already available in the Azure Data Science Virtual Machine. For installation of Docker in on a Linux machine, check instructions [here](https://docs.docker.com/install/)), for example, a PySpark Docker image can be built by the following command line
+>   ```{shell}
+>   cd Recommenders/reco_utils/docker/pyspark


this could just be
docker build -t <image_name>/<tag> -f reco_utils/docker/pyspark

reco_utils/docker/pyspark/Dockerfile

loomlike · 2019-06-11T14:23:28Z

@yueguoguo @miguelgfierro What's the status of this?

Also wonder about if we want to have non-pyspark (gpu or cpu) docker support. I like to have that for two reasons:

No hassle to install env (I heard one of our Garage intern says having some issues of installing env and running notebooks).
Make it easy to use container orchestration solutions or even AzureML

miguelgfierro · 2019-06-12T09:11:42Z

I guess we can have 3 versions (cpu, gpu, spark), not sure if we want to also have the full environment

yueguoguo · 2019-06-12T09:43:08Z

@loomlike I am revisiting this part these days.

Quick ones re your ask:

No hassle to install env (I heard one of our Garage intern says having some issues of installing env and running notebooks).

The Docker support in this PR is based on the Jupyter Dockerfile so it is for launching a Jupyter notebook on whatever desired platform where Docker is run, to run the notebooks in our repo, without upfront environment setup. So this scenario will be nice fit.

If it is urgently needed, you can use the image I have built previously. E.g., PySpark one can be found here. Instructions for use can be found in the REAMDE of the branch in this PR.

Make it easy to use container orchestration solutions or even AzureML

Need modification to better support this scenario.

gramhagen · 2019-06-13T12:20:03Z

Another thought here. It might actually be preferable to avoid images that have libraries pre installed as that will introduce potential differences from our existing setup.

yueguoguo · 2019-06-17T07:44:52Z

Currently, using the setup instruction directly in the docker image build yields an image with size of 22GB...

miguelgfierro · 2019-06-17T21:58:19Z

Currently, using the setup instruction directly in the docker image build yields an image with size of 22GB...

22? that's crazy. Do you know what is making the image so big?

gramhagen · 2019-06-18T00:31:39Z

I was looking into building a docker image for the reco_base env. and then thought I would build off what you have here to try out working from minimal ubuntu image. The final spark image is only 6.5GB which seems reasonable. I did have to make some changes to the spark test setup though (something bad happens if you allocate too little memory in docker).

I put some of the work I've been doing here: https://github.com/microsoft/recommenders/tree/gramhagen/spark_test_fix

there's a tools/docker folder with two images. Maybe we can consolidate these two branches?

yueguoguo · 2019-06-18T03:15:25Z

@gramhagen FYI I updated the Dockerfile to reduce the dependencies of the packages in the built image. It also allows users to specify argument value for building image with different environment.

Sure let's see how to merge the work together.

miguelgfierro · 2019-06-18T12:01:22Z

SETUP.md

+
+Assuming Docker is pre-installed and configured (**NOTE** `docker` command is already available in the Azure Data Science Virtual Machine. For installation of Docker in on a Linux machine, check instructions [here](https://docs.docker.com/install/)), for example, a PySpark Docker image can be built by the following command line
+>   ```{shell}
+>   cd Recommenders/reco_utils/docker/pyspark


@yueguoguo is the plan to finally create the tools folder and add the docker there?

docker/Dockerfile

gramhagen · 2019-06-24T20:01:33Z

have you tried to run this on a dsvm? I just built the image and ran into a few problems. It looks like jupyter is setup to require a token, but it is also listening to traffic on 127.0.0.1, which will cause problems if you're running on a vm. I think the jupyter ip address should be set to 0.0.0.0.

also, it might be easier to just disable the token based authentication for jupyter?

I tried to run some of the tests and they failed because they could not find the reco_utils package, maybe this needs to be added to the pythonpath?

on the plus side the image size is down to ~6GB, so close to what I was seeing with the image built from scratch. I'm still a bit concerned about introducing so many discrepancies btw the image and the default installation based on full conda install. even the version of python used is different (3.7 vs 3.6).

Other than speed of install do you think there's an advantage to going with this approach vs. clean install off ubuntu image?

yueguoguo · 2019-06-25T13:39:07Z

have you tried to run this on a dsvm? I just built the image and ran into a few problems. It looks like jupyter is setup to require a token, but it is also listening to traffic on 127.0.0.1, which will cause problems if you're running on a vm. I think the jupyter ip address should be set to 0.0.0.0.

I tried running it on the local laptop, and yea it listens to 127.0.0.1. When it is run, the token will be generated which is directly embedded in the url. What problem did you get?

also, it might be easier to just disable the token based authentication for jupyter?

Yea I think it is doable. This is the default setting in the original base image.

I tried to run some of the tests and they failed because they could not find the reco_utils package, maybe this needs to be added to the pythonpath?

This is strange. All should be the same as clone into a local environment...

on the plus side the image size is down to ~6GB, so close to what I was seeing with the image built from scratch. I'm still a bit concerned about introducing so many discrepancies btw the image and the default installation based on full conda install. even the version of python used is different (3.7 vs 3.6). Other than speed of install do you think there's an advantage to going with this approach vs. clean install off ubuntu image?

Ah just realized that the Python version is 3.7. They just updated the Python version in the base image... Probably a tag pointing to the right version needs to be used. IMHO, the current base image is for the convenience to use Jupyter notebooks which are the major assets in our repository. We can definitely do everything from scratch from a ubuntu base image, but maybe that will be to reinvent the wheel... What do you think?

gramhagen · 2019-06-25T13:57:11Z

I wasn't able to get to the jupyter server on the container when I ran it remotely, I created an ssh tunnel fwding port 8888 to my machine. I suspect using 0.0.0.0 would work better. I can't remember if you see output from the container if you run it in detached mode? If not then you won't be able to see the token. I'm not sure we're worried too much about security inside the container?

I think it's worth building from scratch as we can lock down exactly what's used and we can leverage the same setup as is done locally. I don't think there's a whole lot going on that would be reinvented. Basically all that's needed is some apt-installs, install spark, anaconda, then grab our repo setup an environment and adjust jupyter config settings. The only disadvantage is it's slower to build the image.

loomlike · 2019-06-25T16:04:42Z

@yueguoguo Can we have another ARG to be able to specify a specific branch of the recommenders repo?

Example use case will be -- If I'm working on something on my branch, and I want to use this Dockerfile to create a docker image containing all of my changes (in my branch).

yueguoguo · 2019-06-26T03:25:44Z

@gramhagen

I wasn't able to get to the jupyter server on the container when I ran it remotely, I created an ssh tunnel fwding port 8888 to my machine. I suspect using 0.0.0.0 would work better. I can't remember if you see output from the container if you run it in detached mode? If not then you won't be able to see the token. I'm not sure we're worried too much about security inside the container?

Ah I see. I have not tried forward a container run on a remote machine (say a VM) and tunnel it to a local one. I ran it on the local machine directly with docker run --rm -p 8888 yueguoguo/reco_pyspark and it generated the output, thus token. I believe running in a detached mode will not give the interactive output. The token and security come from the base image, and sure if there is no particular concern about it we can remove that.

I think it's worth building from scratch as we can lock down exactly what's used and we can leverage the same setup as is done locally. I don't think there's a whole lot going on that would be reinvented. Basically all that's needed is some apt-installs, install spark, anaconda, then grab our repo setup an environment and adjust jupyter config settings. The only disadvantage is it's slower to build the image.

Yea make sense. Using the same approach to set up the whole environment as we do in the repo generated a super big image (like what I tried previously) that needs to be optimized

yueguoguo · 2019-06-26T03:28:44Z

@yueguoguo Can we have another ARG to be able to specify a specific branch of the recommenders repo? Example use case will be -- If I'm working on something on my branch, and I want to use this Dockerfile to create a docker image containing all of my changes (in my branch)

Yea sure actually it was included in the last PR but I removed it, as the idea is to provide user (of the Docker image) a stable version. It certainly makese sense to have the variable to allow developers to customize an image for their own use.

gramhagen · 2019-06-26T14:52:51Z

@yueguoguo here's a gist for 3 images, they work fine from both the notebook and pytest perspectives (running locally or on remote vm). The final images are ~5-8GB. Thoughts on how what we should incorporate here?

https://gist.github.com/gramhagen/c572d9677ccfb43a323f7e42b071c059

yueguoguo · 2019-06-27T12:03:16Z

@gramhagen It looks great. Just one general question, do we want to have just one Dockerfile with probably multiple stages in it, such that user can build with a desirable environment by passing the corresponding argument - there is something in common in all of the three Dockerfiles. What do you think?

gramhagen · 2019-06-27T12:07:58Z

@yueguoguo yes I think that's a good idea, I can try using your conditional logic to consolidate them. Would be nice to only have one to maintain.

docker/Dockerfile

gramhagen

oops, looks like I need to resolve changes to setup.md vs docker/readme.md

docker/Dockerfile

yueguoguo · 2019-07-30T07:02:48Z

@gramhagen FYI I added a note for exporting the environment variable of DOCKER_BUILDKIT to make sure the multi-stage build runs well.

In general, I think the PR is good to go, and maybe we want to merge it into master?

gramhagen · 2019-07-30T13:37:41Z

Ok, cool. Merged this to staging, will create a PR for merging to master.

* new file with wikidata functions * fix in json extraction * new notebook with wikidata use examples * retry request with lowercase in case of failure * WIP: example creating KG from movielens entities * introduced new step to retrieve first page title from a text query in wikipedia * updated movielens links extraction using wikidata * adapted docstrings for sphinx and removed parenthesis from output * added description and labels to nodes to graph preview * #778 (comment) new format for queries * raising exceptions in requests and using get() to retrieve dict values * moved imports to first cell and movielens size as a parameter * output file name as paramenter * DATA: update sum check * adding unit test for sum to 1 issue * improved description and adapted to tests * improved Exception descriptions * integration tests * unit tests * added wikidata_KG to conftest * changed name notebook * *NOTE: Adding shows the computation time of all tests.* * imports up * Update wikidata.py * changed default parameter of sample for tests * Add sphinx documentation for wikidata * modified parameter extraction for tests * added parameters tag to cell * changed default sampling to test parameters in test * notebook cleaned cells output * Docker Support (#718) * DOCKER: add pyspark docker file * DOCKER: remove unused line * DOCKER: remove old file * DOCKER: add SETUP text * DOCKER: add azureml` * DOCKER: udpate dockerfile * DOCKER: use a branch of the repo * SETUP: update setup * DOCKER: update dockerfile * DOC: update setup * DOCKER: one that binds all * SETUP: update docker use * DOCKER: move to top level * SETUP: use a different base name * DOCKER: use the same keywords in the repo for environment arg * SETUP: update environment variable names * updating dockerfile to use multistage build and adding readme * adding full stage * fixing documentation * adding info for running full env * README: update notes for exporting environment on certain platform * README: updated with example on Windows * README: fix typo

* DOCKER: add pyspark docker file * DOCKER: remove unused line * DOCKER: remove old file * DOCKER: add SETUP text * DOCKER: add azureml` * DOCKER: udpate dockerfile * DOCKER: use a branch of the repo * SETUP: update setup * DOCKER: update dockerfile * DOC: update setup * DOCKER: one that binds all * SETUP: update docker use * DOCKER: move to top level * SETUP: use a different base name * DOCKER: use the same keywords in the repo for environment arg * SETUP: update environment variable names * updating dockerfile to use multistage build and adding readme * adding full stage * fixing documentation * adding info for running full env * README: update notes for exporting environment on certain platform * README: updated with example on Windows * README: fix typo

yueguoguo added 4 commits April 8, 2019 14:34

DOCKER: add pyspark docker file

9bc2986

DOCKER: remove unused line

aaceefb

DOCKER: remove old file

39ab134

DOCKER: add SETUP text

59b8801

yueguoguo added the needs discussion Needs further discusssion label Apr 8, 2019

yueguoguo requested review from gramhagen and msalvaris April 8, 2019 08:11

yueguoguo requested a review from miguelgfierro as a code owner April 8, 2019 08:11

DOCKER: add azureml`

757606f

miguelgfierro reviewed Apr 8, 2019

View reviewed changes

DOCKER: udpate dockerfile

860f2cf

yueguoguo changed the title ~~Docker PySpark~~ Docker PySpark - DISCUSSION AND IDEAS NEEDED Apr 8, 2019

DOCKER: use a branch of the repo

894c04d

gramhagen reviewed Apr 8, 2019

View reviewed changes

yueguoguo added 2 commits June 14, 2019 10:11

SETUP: update setup

9b7592d

DOCKER: update dockerfile

fa0f756

yueguoguo added 5 commits June 18, 2019 09:43

DOC: update setup

27a2c20

DOCKER: one that binds all

bb52f86

SETUP: update docker use

097baa1

DOCKER: move to top level

478b54a

SETUP: use a different base name

045911a

miguelgfierro reviewed Jun 18, 2019

View reviewed changes

updating dockerfile to use multistage build and adding readme

c1c4f6f

yueguoguo commented Jul 10, 2019

View reviewed changes

docker/Dockerfile Show resolved Hide resolved

gramhagen requested changes Jul 10, 2019

View reviewed changes

miguelgfierro reviewed Jul 11, 2019

View reviewed changes

docker/Dockerfile Show resolved Hide resolved

gramhagen added 2 commits July 16, 2019 15:45

adding full stage

9b8b657

fixing documentation

aeecbc5

gramhagen removed the needs discussion Needs further discusssion label Jul 29, 2019

gramhagen changed the title ~~Docker PySpark - DISCUSSION AND IDEAS NEEDED~~ Docker Support Jul 29, 2019

gramhagen and others added 3 commits July 29, 2019 14:36

adding info for running full env

9f62947

README: update notes for exporting environment on certain platform

c41ab56

README: updated with example on Windows

57c5c09

README: fix typo

c144b92

gramhagen approved these changes Jul 30, 2019

View reviewed changes

gramhagen merged commit 83e0d69 into staging Jul 30, 2019

gramhagen mentioned this pull request Jul 30, 2019

Staging to master #882

Merged

3 tasks

miguelgfierro deleted the le_docker branch July 30, 2019 14:25

miguelgfierro mentioned this pull request Aug 1, 2019

notice #890

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker Support #718

Docker Support #718

yueguoguo commented Apr 8, 2019 •

edited

miguelgfierro left a comment

miguelgfierro Apr 8, 2019

yueguoguo Apr 8, 2019

gramhagen Apr 8, 2019

loomlike Apr 22, 2019

miguelgfierro Jun 18, 2019

miguelgfierro Jul 11, 2019

gramhagen Jul 11, 2019

yueguoguo Jul 13, 2019

gramhagen Jul 15, 2019

gramhagen left a comment

gramhagen Apr 8, 2019

loomlike commented Jun 11, 2019

miguelgfierro commented Jun 12, 2019

yueguoguo commented Jun 12, 2019 •

edited

gramhagen commented Jun 13, 2019

yueguoguo commented Jun 17, 2019

miguelgfierro commented Jun 17, 2019

gramhagen commented Jun 18, 2019

yueguoguo commented Jun 18, 2019

miguelgfierro Jun 18, 2019

gramhagen commented Jun 24, 2019

yueguoguo commented Jun 25, 2019

gramhagen commented Jun 25, 2019

loomlike commented Jun 25, 2019

yueguoguo commented Jun 26, 2019

yueguoguo commented Jun 26, 2019

gramhagen commented Jun 26, 2019

yueguoguo commented Jun 27, 2019

gramhagen commented Jun 27, 2019

gramhagen left a comment

yueguoguo commented Jul 30, 2019

gramhagen commented Jul 30, 2019

Docker Support #718

Docker Support #718

Conversation

yueguoguo commented Apr 8, 2019 • edited

Description

Related Issues

Checklist:

miguelgfierro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gramhagen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

loomlike commented Jun 11, 2019

miguelgfierro commented Jun 12, 2019

yueguoguo commented Jun 12, 2019 • edited

gramhagen commented Jun 13, 2019

yueguoguo commented Jun 17, 2019

miguelgfierro commented Jun 17, 2019

gramhagen commented Jun 18, 2019

yueguoguo commented Jun 18, 2019

Choose a reason for hiding this comment

gramhagen commented Jun 24, 2019

yueguoguo commented Jun 25, 2019

gramhagen commented Jun 25, 2019

loomlike commented Jun 25, 2019

yueguoguo commented Jun 26, 2019

yueguoguo commented Jun 26, 2019

gramhagen commented Jun 26, 2019

yueguoguo commented Jun 27, 2019

gramhagen commented Jun 27, 2019

gramhagen left a comment

Choose a reason for hiding this comment

yueguoguo commented Jul 30, 2019

gramhagen commented Jul 30, 2019

yueguoguo commented Apr 8, 2019 •

edited

yueguoguo commented Jun 12, 2019 •

edited