# Who is this notebook for?
This notebook is meant for those who want to test out the functionalities mentioned in the project without the hassle of installing and setting up Java, Apache Hadoop and Apache Spark.\
If you already have installed and set up, or will be installing and setting up, Hadoop and Spark on your machine, you don't need to set up the Docker application anymore. In such case, just disregard this notebook.

# Running The Docker Application

## Installing Docker

With that in mind, the first thing you need to do is installing [Docker](https://www.docker.com/) if you haven't. For Windows, Mac and Linux, you can install Docker through [Docker Desktop](https://docs.docker.com/get-started/get-docker/). For Linus, you can also install the [Docker Engine](https://docs.docker.com/engine/install/) without having to install the Docker Desktop UI.\
***Note that you may need to restart your machine after installation***

After installation and setting up Docker, you can test it out by deploying some prebuilt images. If you are completely new to Docker, you can visit the [Introduction course](https://docs.docker.com/get-started/introduction/) after installation to test Docker out. For the project, we will be working with these, included but not limited to, Docker concepts:
- [Images](https://docs.docker.com/get-started/docker-concepts/the-basics/what-is-an-image/) and [Containers](https://docs.docker.com/get-started/docker-concepts/the-basics/what-is-a-container/).
- [Publishing ports](https://docs.docker.com/get-started/docker-concepts/running-containers/publishing-ports/) of containers.
- Sharing files between host and containers by [bind mounting](https://docs.docker.com/get-started/docker-concepts/running-containers/sharing-local-files/).
- Running multi-container application with [Docker Compose](https://docs.docker.com/get-started/docker-concepts/the-basics/what-is-docker-compose/) file.

## Deploying the application on your host

When everything is set, you can finally get started by deploying the applciation on your host machine.\
In your CLI of choice, navigate to the [`docker-hadoop/`](./docker-hadoop/) directory and run:
```bash
docker-compose up -d
```
You should see the following output:

![compose_output](./resource/demo/compose_output.png)

If this is your first time building the services, you will need to wait for Docker to pull the images to your machine, which may take up to 5 minutes depending on your connection speed.

When everything finishes, you can check the status of running containers by executing `docker-compose ps` or `docker ps`. If you have installed Docker Desktop, you can also see the status reported in the Containers tab.

![status_cli](./resource/demo/status_cli.png)
![status_ui](./resource/demo/status_ui.png)

---
---

*Side note*: When you are done with the current session and want to terminate the application, run the following command:
```bash
docker-compose stop
```
This will stop the running containers and you can start the container again with `docker-compose start`.

However, if you want to completely remove the containers and networks created by the `docker-compose.yml` file, run:
```bash
docker-compose down
```
To also remove the associated volumes, specify the additional option `-v`.

For more information about the `docker compose` command, see [here](https://docs.docker.com/reference/cli/docker/compose/).

# Using the Docker Application
After everything is successfully deployed, you can use the application as is, you may also modify the `docker-compose.yml` file to better suit your needs.

The following section demonstrate some of the functionalities available.

# Accessing the HDFS

## Within the bash shell

One way to interact with the HDFS hosted by the application is through the bash shell of the namenode container. To enter the container's bash shell, run:
```bash
docker exec -it namenode bash
```
Note that the namenode container must be running in order to do this.

![namenode_bash](./resource/demo/namenode_bash.png)

Within this container, you can access the HDFS by using `hdfs dfs` command. Run `hdfs dfs -help` to see available commands and their options.

## Within the WebHDFS

By default, the Docker application is set up with WebHDFS and the 9870 UI port has been exposed to the same port in host for access outside of the container. To access the UI, just visit http://localhost:9870.

Within the web UI, you can view the HDFS by going to `Browse the file system` option inside the `Utilities` tab. Here you can interactively create or delete directories and files.
<div align='center'>
    <img src="./resource/demo/file_browse.png">
    <img src="./resource/demo/webhdfs.png" width="660">
</div>

However, you will run into an error when uploading files from your local file system, this is because the WebHDFS is trying to do data transfer through http://datanode1:9000 which is non-exisitent outside of the Docker application environment. Thus, if you have some automation tasks that require saving directly to the HDFS, you must do this from within the application.

Demonstration of how to do this is provided in the next section.

# Executing Python scripts within the application

#