New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenRefine docker containers #933

Open
psychemedia opened this Issue Jan 17, 2015 · 14 comments

Comments

Projects
None yet
5 participants
@psychemedia
Copy link

psychemedia commented Jan 17, 2015

Would it be useful to make some official releases of OpenRefine available via dockerhub? I've sketched a couple of Dockerfiles at:

that I think do the job, though there may be better ways of doing them?

In respect of official releases, it could be handy to have:

  • openrefine/openrefine-latest-build
  • openrefine/openrefine-latest-release
  • openrefine/openrefine-stable

An example of how to create a dockerfile that adds extensions to any of these base builds might also be useful for cut-and-paste tinkerers such as myself...

@Ravenwater

This comment has been minimized.

Copy link

Ravenwater commented Jan 28, 2015

I like containers, so I want to think with you. Given the fact that openrefine is already a self-contained widget that minimally inserts itself into a machine, the only workflow that I can see an OpenRefine container would enable is dispatching it in the cloud in a fast and reliable way. To do that properly, the dockerfile would need to architect a storage hierarchy, so that there is some persistence and workspace to get non-trivial things done. If you are working with data sets that reside in a public/private cloud it would be interesting to push the OpenRefine functionality to the data and reduce cost, or increase performance, but that would only kick in for data sets that are over a couple GB. That again would make the storage design more important.

What was the use case that you designed the dockerfile for?

@psychemedia

This comment has been minimized.

Copy link
Author

psychemedia commented Jan 28, 2015

@Ravenwater The particular use case I have been exploring is the assembly of virtual machines for use in distance education by remote students. I am part of a team writing a distance education course on data management and data analysis for the UK Open University, and we are supplying students with a virtual machine that contains various databases (mongodb, postgresql), an analysis environment (IPython notebooks with a lot of preinstalled python packages), and OpenRefine.

The original model was to give students a single VM image configured using vagrant and docker. (As well as getting this particular machine built, I was also interested in ways in which we might support workflows for creating new VM configurations, as well as supporting VM configurations for courses that run once or twice a year for up to 5 years). The particular use case we have at the moment is for students to download a VM image and run it using VirtualBox on their own computer; but I was mindful that there might also be a requirement to support students who want to access the VM running in the cloud or on institutional cloud servers. (Another model might be a traditional university where students use computer labs and need to access software packaged and maintained by the institution on pubic access machines).

I also started exploring the idea of being able create a VM that was assembled as a combination of docker containers. Part of the attraction of this is that an institution could maintain a set of "approved containers" that could be called on by people developing a new course requiring a new set-up.

The challenge then becomes one of adding data into the mix eg a set of OpenRefine example projects accessible from an OpenRefine container, eg a database preconfigured with some example database tables, eg IPython Notebook pointing to a directory containing example notebooks.

A brief timeline of my journey is described in these blogposts:

http://blog.ouseful.info/2013/12/02/packaging-software-for-distance-learners-vms-101/
http://blog.ouseful.info/2014/05/15/confused-again-about-vm-ecology-i-blame-not-blogging/
http://blog.ouseful.info/2014/12/10/thoroughly-confused-about-student-vms-docker/
http://blog.ouseful.info/2015/01/14/using-docker-to-build-course-vms/

@Ravenwater

This comment has been minimized.

Copy link

Ravenwater commented Jan 29, 2015

@tony Excellent, thank you for that background: makes perfect sense;
been through that same 'how to package a collection' experiment myself,
for us it also included the hardware and software appliance module.

The problem that we encountered is that some academic environments are
very Microsoft oriented, and some are very Apple oriented, and most of
the good server side web application stuff is Canonical oriented. Than
through in a smorgasbord of Java and JVM languages, JS environments for
web development, and PSE such as MATLAB, COMSOL, Gaussian, SciLab,
Octave, and R, and you end up without a good answer.

We have been putting some elbow grease into containerizing the server
side (Hadoop, Cassandra, etc.) guided by the observation that a modern
SOA is going to be a micro-services platform and containers are the
right weight for delivering said microservice SOAs. However, that
doesn't solve the MATLAB/COMSOL/Gaussian commercial software, which are
real workhorses in academia where the cost to use these is minimal.

I still like the Vagrant path as the tools are more mature and work on
Microsoft environments.

Love to collaborate as we are doing similar things.

On 1/28/2015 5:03 AM, Tony Hirst wrote:

@Ravenwater https://github.com/Ravenwater The particular use case I
have been exploring is the assembly of virtual machines for use in
distance education by remote students. I am part of a team writing a
distance education course on data management and data analysis for the
UK Open University http://www.open.ac.uk/courses/modules/tm351, and
we are supplying students with a virtual machine that contains various
databases (mongodb, postgresql), an analysis environment (IPython
notebooks with a lot of preinstalled python packages), and OpenRefine.

The original model was to give students a single VM image configured
using vagrant and docker. (As well as getting this particular machine
built, I was also interested in ways in which we might support
workflows for creating new VM configurations, as well as supporting VM
configurations for courses that run once or twice a year for up to 5
years). The particular use case we have at the moment is for students
to download a VM image and run it using VirtualBox on their own
computer; but I was mindful that there might also be a requirement to
support students who want to access the VM running in the cloud or on
institutional cloud servers. (Another model might be a traditional
university where students use computer labs and need to access
software packaged and maintained by the institution on pubic access
machines).

I also started exploring the idea of being able create a VM that was
assembled as a combination of docker containers. Part of the
attraction of this is that an institution could maintain a set of
"approved containers" that could be called on by people developing a
new course requiring a new set-up.

The challenge then becomes one of adding data into the mix eg a set of
OpenRefine example projects accessible from an OpenRefine container,
eg a database preconfigured with some example database tables, eg
IPython Notebook pointing to a directory containing example notebooks.

A brief timeline of my journey is described in these blogposts:

http://blog.ouseful.info/2013/12/02/packaging-software-for-distance-learners-vms-101/
http://blog.ouseful.info/2014/05/15/confused-again-about-vm-ecology-i-blame-not-blogging/
http://blog.ouseful.info/2014/12/10/thoroughly-confused-about-student-vms-docker/
http://blog.ouseful.info/2015/01/14/using-docker-to-build-course-vms/


Reply to this email directly or view it on GitHub
#933 (comment).

@psychemedia

This comment has been minimized.

Copy link
Author

psychemedia commented Jan 29, 2015

@Ravenwater Agreed on the commercial software side, though I can imagine (not sure how it would be implemented) something like boot2dockerPro that provides you with metered access to a commercial software container, and that logs the use of said containers, providing billing at an institutional level say. Eg I could imagine an institutional dockerhub that includes commercial as well as free containers, with the institution paying license fees on commercial containers based on a per get basis, or metering via docker runners just logs use of the commercial containers. (Though possibly intrusive, I can also imagine the desirability for corporate types to be able to track usage of all containers using docker on institutional machines.)

A real issue we have at the UK Open University is the current IT policy that virtual machines are not allowed on trusted/managed computers because they might introduce vulnerabilities (In part, the VM would have access to the core IT network, and arbitrary code can be installed and run in the VM that could then act maliciously within the core network context).

As well as the educational context, I have an interest in working with data in general and open data sets in particular, for example in a news context. I first took an interest in VMs after seeing @DataMinerUK's Infinite Interns project, which supports the creation of a variety of VM configurations for use in a data journalistic context. In the last few days, I've started pondering how we might create custom configurations based around orchestrated assemblies of containers eg Frictionless Data Analysis – Trying to Clarify My Thoughts.

*(You might note that a lot of my interests relate to client side use of containers which I think is underexplored at the moment?)

Would be great to be able to both work up and generalise some of these ideas further, and also explore the use cases you're looking at.

@psychemedia

This comment has been minimized.

Copy link
Author

psychemedia commented Feb 1, 2015

To complement the OpenRefine container, I've started exploring a container for running CSV backed reconciliation services using the Open Knowledge Lab Reconcile-CSV server : Reconcile-CSV container

Whilst my early tests work fine for calling the reconciliation service from an OpenRefine container by calling the container's IP address on host, I can't seem to register the reconciliation service published by a _--link_ed to Reconcile-CSV server from the OpenRefine container using aliases or IP addresses that exist within docker?

@Ravenwater

This comment has been minimized.

Copy link

Ravenwater commented Feb 2, 2015

@psychemedia third-party virtualized workloads would have to go through a hardening and security review before they could be introduced and deployed in trusted environments: that unfortunately is just a fact of life for exactly the reasons you describe: arbitrary code could be brought in thus breaching any trust. Interestingly enough, for the DoS security vector, VMs are a better choice as the hypervisor can exert some control over the VM's use of resources: container technologies are not as mature yet and really can take down a machine for other tenants. That implies that you can apply containers where there is a need for efficiency, and VMs where there is a need for control.

If you have a dockerfile that you can share, I can take a look and debug.

@Rots

This comment has been minimized.

Copy link
Contributor

Rots commented Jan 27, 2019

I'll add my use case (which I suspect is much more common than the discussions above)

As a developer I would like to have a fast way of testing the (latest) openrefine without much hassle. Having an official Docker image would help me to easily deploy and test-run the software with the tools that are already familiar to me (Docker) without having to trust any other third parties ( e.g. there are tens of docker images available on Docker Hub that contain OpenRefine).
Even though OpenRefine "is already a self-contained widget that minimally inserts itself into a machine" I feel much more comfortable pulling and running a container (without any bind-mounts) compared to running a script from the internet which has access to my file system and downloads some extra stuff (Java, Maven) to make the software work. Docker image is a standard way of delivering software (and it works mostly fine even without internet access once you have it).
Also that helps with the cleanup and I don't have to worry about having the conflicting versions of dependencies (Yes, even Maven has many versions) around or having to clean up these manually afterwards.

I think the effort is worth for the project to gain some more traction/popularity among the "average" tinkerers.

@wetneb

This comment has been minimized.

Copy link
Member

wetneb commented Jan 27, 2019

@Rots do the Dockerfiles proposed above work for you? If so and if you want them to be part of the official repo, it might just be a matter of submitting these as a PR.

@thadguidry

This comment has been minimized.

Copy link
Member

thadguidry commented Jan 27, 2019

(Dockerfile links in initial comment were broken, since Docker Hub made some slight changes....but updated them with good links now)

@thadguidry

This comment has been minimized.

Copy link
Member

thadguidry commented Jan 27, 2019

Mine as example:

FROM maven:3.6.0-jdk-8-alpine
MAINTAINER thadguidry@gmail.com
RUN apk add --no-cache git
RUN git clone https://github.com/OpenRefine/OpenRefine.git 
RUN OpenRefine/refine build
RUN mkdir /mnt/refine
VOLUME /mnt/refine
EXPOSE 3333
CMD ["OpenRefine/refine", "-i", "0.0.0.0", "-d", "/mnt/refine"]
@Rots

This comment has been minimized.

Copy link
Contributor

Rots commented Jan 27, 2019

add:
VOLUME /mnt/refine

@psychemedia

This comment has been minimized.

Copy link
Author

psychemedia commented Feb 6, 2019

Here's another approach that uses a release:

FROM maven:3.6.0-jdk-8-alpine
MAINTAINER tony.hirst@gmail.com

ARG RELEASE=3.1
ENV RELEASE=$RELEASE

RUN wget --no-check-certificate https://github.com/OpenRefine/OpenRefine/releases/download/$RELEASE/openrefine-linux-$RELEASE.tar.gz
RUN tar -xzf openrefine-linux-$RELEASE.tar.gz  && rm openrefine-linux-$RELEASE.tar.gz
RUN mkdir /mnt/refine
VOLUME /mnt/refine
EXPOSE 3333
CMD openrefine-$RELEASE/refine -i 0.0.0.0 -d /mnt/refine

#Reference:
##to peek inside the container:
# docker run -i -t psychemedia/openrefinetest /bin/bash

##to run:
# docker run --rm -d -p 3333:3333 --name openrefine psychemedia/openrefinedemo

You can build to a specific version with eg:

docker build -t psychemedia/openrefinedemo --build-arg RELEASE=3.1-beta .

@psychemedia

This comment has been minimized.

Copy link
Author

psychemedia commented Feb 7, 2019

@thadguidry Is there a clean step that can be added to your dockerfile to tidy up the container and make it essentially a container for the the clean/distributable app?

Also, does OpenRefine require Java JDK to run, or can it get by with JRE?

@thadguidry

This comment has been minimized.

Copy link
Member

thadguidry commented Feb 7, 2019

@psychemedia Yes, extra RUN commands could be added to remove build folders/files and only leave the classes/libs. @wetneb probably has that info that could be added. OpenRefine when it is compiled (refine build command) needs the JDK, but after compiled and to run it, only needs JRE. "Standard Java Web App Stuff". My Dockerfile is for active development usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment