Skip to content

Starter environment for Artificial Intelligence and Big Data using docker. Includes: Kafka for Pub/Sub, Spark for computation, Cassandra for storing, ElasticSearch for indexing, Kibana for visualization, Anaconda for data-science with python, Jupyter Notebook for coding, Selenium for scraping.

Notifications You must be signed in to change notification settings

Jonarod/ai_bigdata_starter

Repository files navigation

Docker for Artificial Inteligence and Big Data

docker run -it --rm --privileged -v `pwd`:/home/guest/host -p 23:22 -p 4040:4040 -p 5601:5601 -p 8888:8888 -p 9200:9200 -p 9300:9300 jonarod/ai_bigdata_starter

Comes with:

  • Spark for parallelized computation
  • Kafka for Publish / Subscribe messaging
  • Anaconda for Data Science using Python (pre-installs sci-kit learn, pandas, numpy...)
  • Cassandra for clusterized storage
  • ElasticSearch for indexing and fast front-end API generation
  • Kibana for ElasticSearch's exploration and visualisation
  • Jupyter notebook as clean coding IDE using Ipython
  • Selenium + Firefox + Geckodriver for accurate web-scraping (executes javascript using an headless browser)

Get started

First of all, start the container:

docker run -it --rm --privileged -v `pwd`:/home/guest/host -p 23:22 -p 4040:4040 -p 5601:5601 -p 8888:8888 -p 9200:9200 -p 9300:9300 jonarod/ai_bigdata_starter

The command will start a pseudo-tty session in bash. There you can start services all at once using:

startup_script.sh

Wait around 1 mn for all services to start.

Alternatively, you can start services one by one:

Start ssh:

service sshd start

Start cassandra:

service cassandra start

Start ElasticSearch:

service elasticsearch start

Start Kibana: (make sure to start ElasticSearch first as kibana depends on it)

service kibana start

Start Kafka: kafka needs a bit more overhead as it needs zookeeper to run properly.

# start zookeeper first
$HOME/kafka/bin/zookeeper-server-start.sh $HOME/kafka/config/zookeeper.properties  > /home/guest/zookeeper.log 2>&1 &

# then start a broker
$HOME/kafka/bin/kafka-server-start.sh $HOME/kafka/config/server.properties > /home/guest/kafka.log 2>&1 &

Start jupyter notebook:

notebook --allow-root

PRO-TIPS: Using proper docker commands you can:

Start one service using one line:

docker run -it -p 8888:8888 -v `pwd`:/home/guest/host jonarod/ai_bigdata_starter jupyter notebook --ip=0.0.0.0 --allow-root

Enter in a running container:

# first check running docker instances:
docker ps

CONTAINER ID        IMAGE                        COMMAND      CREATED             STATUS              PORTS                    NAMES
490ac3c59958        Jonarod/ai_bigdata_starter   "..."        18 seconds ago      Up 17 seconds       0.0.0.0:8888->8888/tcp   gifted_galileo

# Then run
docker exec -it <RUNNING DOCKER ID HERE> bash

Install other packages:

docker exec <RUNNING CONTAINER ID HERE> pip install xgboost
docker exec <RUNNING CONTAINER ID HERE> yum install unzip

About

Starter environment for Artificial Intelligence and Big Data using docker. Includes: Kafka for Pub/Sub, Spark for computation, Cassandra for storing, ElasticSearch for indexing, Kibana for visualization, Anaconda for data-science with python, Jupyter Notebook for coding, Selenium for scraping.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published