New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker Support #718
Docker Support #718
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is great Le!
SETUP.md
Outdated
|
||
Assuming Docker is pre-installed and configured (**NOTE** `docker` command is already available in the Azure Data Science Virtual Machine. For installation of Docker in on a Linux machine, check instructions [here](https://docs.docker.com/install/)), for example, a PySpark Docker image can be built by the following command line | ||
> ```{shell} | ||
> cd Recommenders/reco_utils/docker/pyspark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on renaming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could just be
docker build -t <image_name>/<tag> -f reco_utils/docker/pyspark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the name tools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yueguoguo is the plan to finally create the tools
folder and add the docker there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gramhagen what is the final decision for the location of docker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do people think about renaming scripts -> tools
then the structure for this would be
- tools/
- docker/
- Dockerfile
- repo_metrics/
- databricks_install.py
- generate_conda_file.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. Alternatively, we may want to put docker
at the top-level? I see many other repos do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair, I've seen this quite a bit too. ok. I'm leaning towards keeping this in docker folder at the top level, that's not too cluttered.
we can rename scripts in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty cool, i just have a few questions
SETUP.md
Outdated
|
||
Assuming Docker is pre-installed and configured (**NOTE** `docker` command is already available in the Azure Data Science Virtual Machine. For installation of Docker in on a Linux machine, check instructions [here](https://docs.docker.com/install/)), for example, a PySpark Docker image can be built by the following command line | ||
> ```{shell} | ||
> cd Recommenders/reco_utils/docker/pyspark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could just be
docker build -t <image_name>/<tag> -f reco_utils/docker/pyspark
@yueguoguo @miguelgfierro What's the status of this? Also wonder about if we want to have non-pyspark (gpu or cpu) docker support. I like to have that for two reasons:
|
I guess we can have 3 versions (cpu, gpu, spark), not sure if we want to also have the full environment |
@loomlike I am revisiting this part these days. Quick ones re your ask:
The Docker support in this PR is based on the Jupyter Dockerfile so it is for launching a Jupyter notebook on whatever desired platform where Docker is run, to run the notebooks in our repo, without upfront environment setup. So this scenario will be nice fit. If it is urgently needed, you can use the image I have built previously. E.g., PySpark one can be found here. Instructions for use can be found in the REAMDE of the branch in this PR.
Need modification to better support this scenario. |
Another thought here. It might actually be preferable to avoid images that have libraries pre installed as that will introduce potential differences from our existing setup. |
Currently, using the setup instruction directly in the docker image build yields an image with size of 22GB... |
22? that's crazy. Do you know what is making the image so big? |
I was looking into building a docker image for the reco_base env. and then thought I would build off what you have here to try out working from minimal ubuntu image. The final spark image is only 6.5GB which seems reasonable. I did have to make some changes to the spark test setup though (something bad happens if you allocate too little memory in docker). I put some of the work I've been doing here: https://github.com/microsoft/recommenders/tree/gramhagen/spark_test_fix there's a tools/docker folder with two images. Maybe we can consolidate these two branches? |
@gramhagen FYI I updated the Dockerfile to reduce the dependencies of the packages in the built image. It also allows users to specify argument value for building image with different environment. Sure let's see how to merge the work together. |
SETUP.md
Outdated
|
||
Assuming Docker is pre-installed and configured (**NOTE** `docker` command is already available in the Azure Data Science Virtual Machine. For installation of Docker in on a Linux machine, check instructions [here](https://docs.docker.com/install/)), for example, a PySpark Docker image can be built by the following command line | ||
> ```{shell} | ||
> cd Recommenders/reco_utils/docker/pyspark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yueguoguo is the plan to finally create the tools
folder and add the docker there?
have you tried to run this on a dsvm? I just built the image and ran into a few problems. It looks like jupyter is setup to require a token, but it is also listening to traffic on 127.0.0.1, which will cause problems if you're running on a vm. I think the jupyter ip address should be set to 0.0.0.0. also, it might be easier to just disable the token based authentication for jupyter? I tried to run some of the tests and they failed because they could not find the reco_utils package, maybe this needs to be added to the pythonpath? on the plus side the image size is down to ~6GB, so close to what I was seeing with the image built from scratch. I'm still a bit concerned about introducing so many discrepancies btw the image and the default installation based on full conda install. even the version of python used is different (3.7 vs 3.6). Other than speed of install do you think there's an advantage to going with this approach vs. clean install off ubuntu image? |
I tried running it on the local laptop, and yea it listens to 127.0.0.1. When it is run, the token will be generated which is directly embedded in the url. What problem did you get?
Yea I think it is doable. This is the default setting in the original base image.
This is strange. All should be the same as clone into a local environment...
Ah just realized that the Python version is 3.7. They just updated the Python version in the base image... Probably a tag pointing to the right version needs to be used. IMHO, the current base image is for the convenience to use Jupyter notebooks which are the major assets in our repository. We can definitely do everything from scratch from a ubuntu base image, but maybe that will be to reinvent the wheel... What do you think? |
I wasn't able to get to the jupyter server on the container when I ran it remotely, I created an ssh tunnel fwding port 8888 to my machine. I suspect using 0.0.0.0 would work better. I can't remember if you see output from the container if you run it in detached mode? If not then you won't be able to see the token. I'm not sure we're worried too much about security inside the container? I think it's worth building from scratch as we can lock down exactly what's used and we can leverage the same setup as is done locally. I don't think there's a whole lot going on that would be reinvented. Basically all that's needed is some apt-installs, install spark, anaconda, then grab our repo setup an environment and adjust jupyter config settings. The only disadvantage is it's slower to build the image. |
@yueguoguo Can we have another Example use case will be -- If I'm working on something on my branch, and I want to use this Dockerfile to create a docker image containing all of my changes (in my branch). |
Ah I see. I have not tried forward a container run on a remote machine (say a VM) and tunnel it to a local one. I ran it on the local machine directly with
Yea make sense. Using the same approach to set up the whole environment as we do in the repo generated a super big image (like what I tried previously) that needs to be optimized |
Yea sure actually it was included in the last PR but I removed it, as the idea is to provide user (of the Docker image) a stable version. It certainly makese sense to have the variable to allow developers to customize an image for their own use. |
@yueguoguo here's a gist for 3 images, they work fine from both the notebook and pytest perspectives (running locally or on remote vm). The final images are ~5-8GB. Thoughts on how what we should incorporate here? https://gist.github.com/gramhagen/c572d9677ccfb43a323f7e42b071c059 |
@gramhagen It looks great. Just one general question, do we want to have just one Dockerfile with probably multiple stages in it, such that user can build with a desirable environment by passing the corresponding argument - there is something in common in all of the three Dockerfiles. What do you think? |
@yueguoguo yes I think that's a good idea, I can try using your conditional logic to consolidate them. Would be nice to only have one to maintain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, looks like I need to resolve changes to setup.md vs docker/readme.md
@gramhagen FYI I added a note for exporting the environment variable of In general, I think the PR is good to go, and maybe we want to merge it into master? |
Ok, cool. Merged this to staging, will create a PR for merging to master. |
* new file with wikidata functions * fix in json extraction * new notebook with wikidata use examples * retry request with lowercase in case of failure * WIP: example creating KG from movielens entities * introduced new step to retrieve first page title from a text query in wikipedia * updated movielens links extraction using wikidata * adapted docstrings for sphinx and removed parenthesis from output * added description and labels to nodes to graph preview * #778 (comment) new format for queries * raising exceptions in requests and using get() to retrieve dict values * moved imports to first cell and movielens size as a parameter * output file name as paramenter * DATA: update sum check * adding unit test for sum to 1 issue * improved description and adapted to tests * improved Exception descriptions * integration tests * unit tests * added wikidata_KG to conftest * changed name notebook * *NOTE: Adding shows the computation time of all tests.* * imports up * Update wikidata.py * changed default parameter of sample for tests * Add sphinx documentation for wikidata * modified parameter extraction for tests * added parameters tag to cell * changed default sampling to test parameters in test * notebook cleaned cells output * Docker Support (#718) * DOCKER: add pyspark docker file * DOCKER: remove unused line * DOCKER: remove old file * DOCKER: add SETUP text * DOCKER: add azureml` * DOCKER: udpate dockerfile * DOCKER: use a branch of the repo * SETUP: update setup * DOCKER: update dockerfile * DOC: update setup * DOCKER: one that binds all * SETUP: update docker use * DOCKER: move to top level * SETUP: use a different base name * DOCKER: use the same keywords in the repo for environment arg * SETUP: update environment variable names * updating dockerfile to use multistage build and adding readme * adding full stage * fixing documentation * adding info for running full env * README: update notes for exporting environment on certain platform * README: updated with example on Windows * README: fix typo
* DOCKER: add pyspark docker file * DOCKER: remove unused line * DOCKER: remove old file * DOCKER: add SETUP text * DOCKER: add azureml` * DOCKER: udpate dockerfile * DOCKER: use a branch of the repo * SETUP: update setup * DOCKER: update dockerfile * DOC: update setup * DOCKER: one that binds all * SETUP: update docker use * DOCKER: move to top level * SETUP: use a different base name * DOCKER: use the same keywords in the repo for environment arg * SETUP: update environment variable names * updating dockerfile to use multistage build and adding readme * adding full stage * fixing documentation * adding info for running full env * README: update notes for exporting environment on certain platform * README: updated with example on Windows * README: fix typo
Description
Initialize a PR of Docker support for PySpark environment. This PR is for discussion with team to brainstorm and optimize Docker support for the repo.
NOTE
TODO
HOW-TO
A sample image is created and publicitized on my own Docker hub account (we can and should create one for the team later on)
In a Linux terminal or Windows powershell (assume Docker is pre-installed in the machine)
Open browser and go to localhost:8888 with the token generated in the above run of the image.
UPDATE
2019-04-08
2019-06-21
cpu
,gpu
, andpyspark
, respectivelySETUP.md
is updated accordinglyRelated Issues
Discussed in #687
Checklist: