# Who is this notebook for?
This notebook is meant for those who want to test out the functionalities mentioned in the project without the hassle of installing and setting up Java, Apache Hadoop and Apache Spark.\
If you already have installed and set up, or will be installing and setting up, Hadoop and Spark on your machine, you don't need to set up the Docker application anymore. In such case, just disregard this notebook.

**Table of contents**
- [**Running the Docker Application**](#running-the-docker-application)
- [**Using the Docker Application**](#using-the-docker-application)
    - [Accessing the HDFS](#accessing-the-hdfs)
    - [Executing Python scripts within the application](#executing-python-scripts-within-the-application)
    - [Accessing HDFS with Python](#accessing-hdfs-with-python)
    - [Accessing HDFS from PySpark Session](#accessing-hdfs-from-pyspark-session)

# **Running the Docker Application**

## Installing Docker

With that in mind, the first thing you need to do is installing [Docker](https://www.docker.com/) if you haven't. For Windows, Mac and Linux, you can install Docker through [Docker Desktop](https://docs.docker.com/get-started/get-docker/). For Linus, you can also install the [Docker Engine](https://docs.docker.com/engine/install/) without having to install the Docker Desktop UI.\
***Note that you may need to restart your machine after installation***

After installation and setting up Docker, you can test it out by deploying some prebuilt images. If you are completely new to Docker, you can visit the [Introduction course](https://docs.docker.com/get-started/introduction/) after installation to test Docker out. For the project, we will be working with these, included but not limited to, Docker concepts:
- [Images](https://docs.docker.com/get-started/docker-concepts/the-basics/what-is-an-image/) and [Containers](https://docs.docker.com/get-started/docker-concepts/the-basics/what-is-a-container/).
- [Publishing ports](https://docs.docker.com/get-started/docker-concepts/running-containers/publishing-ports/) of containers.
- Sharing files between host and containers by [bind mounting](https://docs.docker.com/get-started/docker-concepts/running-containers/sharing-local-files/).
- Running multi-container application with [Docker Compose](https://docs.docker.com/get-started/docker-concepts/the-basics/what-is-docker-compose/) file.

## Deploying the application on your host

When everything is set, you can finally get started by deploying the applciation on your host machine.\
In your CLI of choice, navigate to the [`docker-hadoop/`](./docker-hadoop/) directory and run:
```bash
docker-compose up -d
```
You should see the following output:

![compose_output](./resource/demo/compose_output.png)

If this is your first time building the services, you will need to wait for Docker to pull the images to your machine, which may take up to 5 minutes depending on your connection speed.

When everything finishes, you can check the status of running containers by executing `docker-compose ps` or `docker ps`. If you have installed Docker Desktop, you can also see the status reported in the Containers tab.

![status_cli](./resource/demo/status_cli.png)
![status_ui](./resource/demo/status_ui.png)

---
---

*Side note*: When you are done with the current session and want to terminate the application, run the following command:
```bash
docker-compose stop
```
This will stop the running containers and you can start the container again with `docker-compose start`.

However, if you want to completely remove the containers and networks created by the `docker-compose.yml` file, run:
```bash
docker-compose down
```
To also remove the associated volumes, specify the additional option `-v`.

For more information about the `docker compose` command, see [here](https://docs.docker.com/reference/cli/docker/compose/).

# **Using the Docker Application**
After everything is successfully deployed, you can use the application as is, you may also modify the `docker-compose.yml` file to better suit your needs.

The following section demonstrate some of the functionalities available.

# **Accessing the HDFS**

## Within the bash shell

One way to interact with the HDFS hosted by the application is through the bash shell of the namenode container. To enter the container's bash shell, run:
```bash
docker exec -it namenode bash
```
Note that the namenode container must be running in order to do this.

![namenode_bash](./resource/demo/namenode_bash.png)

Within this container, you can access the HDFS by using `hdfs dfs` command. Run `hdfs dfs -help` to see available commands and their options.

## Within the WebHDFS

By default, the Docker application is set up with WebHDFS and the 9870 UI port has been exposed to the same port in host for access outside of the container. To access the UI, just visit http://localhost:9870.

Within the web UI, you can view the HDFS by going to `Browse the file system` option inside the `Utilities` tab. Here you can interactively create or delete directories and files.
<div align='center'>
    <img src="./resource/demo/file_browse.png">
    <img src="./resource/demo/webhdfs.png" width="660">
</div>

However, you will run into an error when uploading files from your local file system, this is because the WebHDFS is trying to do data transfer through http://datanode:9864 which is non-existent outside of the Docker application environment.

# **Executing Python scripts within the application**

As mentioned above, aside from the HDFS service, the application also provide a Jupyter service with PySpark pre-installed.\
This service uses Python 3.11.6 and allow you to run any Python scripts of choice within the container and even establish external connection to the Jupyter Server outside of the Docker environment.

## Entering the container to execute Python scripts
In order to run Python scripts within the container, you must first enter into its bash shell like before:
```bash
docker exec -it docker-hadoop-spark-notebook-1 bash
```
When entered, you will be under the username '**joyvan**'. This is the default user of the image. Within the bash shell, you can invoke Python with `python`. For example, checking the current version of the installed Python executable.

![py_ver](./resource/demo/py_ver.png)

Likewise, to execute a Python file, you can run `python path/to/file.py`; or to use pip install, you can run `python -m pip install library`.

## Using the Jupyter service

Because of the nature of the project, you might also want to run an Interactivate Python session with Jupyter notebooks.

You have two options of doing this:
- Either by using the Web UI
- Or connecting to the server from your own IDE that supports Jupyter notebooks.

**For the accessing the Web UI**: just visit http://localhost:8888. When arrived, you can freely create noteboooks, Python scripts and other files. You may also upload files from your local file system.

**For connecting to the Jupyter Server**:
1. In your Jupyter notebook, when selecting a kernel, choose the option `Existing Jupyter Server...`. This may be locked behind the `Select Another Kernel...` option.
2. When selected, you will be prompted to enter the URL of the Jupyter Server, use the same URL mentioned above.
3. After entering the URL, you will then be asked whether you want to connect to an insecure server, select `Yes`. 
4. After that, you can change your display name of the server and then connect to the Python 3 kernel.

***On remarks of the insecure connection***: This is because we have set up the server to be **passwordless** as well as requiring **no** token authentication. We are hosting the server locally and do not have any intention of sharing access, so there was no need of authentication. If you intend to share access with someone, it is recommended to have at least set up token authentication.

In [None]:
# If you have successfully connected to the server, you can test out the following lines
print('This is some text to print out')

In [None]:
# you can also send some of the command to the bash shell
!ls
# or maybe create some file
!echo '' > just_some_text.txt
!head just_some_text.txt

# **Accessing HDFS with Python**

As mentioned in [Accessing the HDFS](#within-the-webhdfs), you can't upload files from your local file system because you are essentially trying to do data transfer through the address http://datanode:9864 which is non-existent outside of the application. The same thing will happen if you try to access the HDFS through a client like the one provided by [hdfs](https://pypi.org/project/hdfs/). You can establish a connection to http://localhost:9870, but you will not be able to do any read/write operations to the files on HDFS.

For automation tasks that require saving directly to the HDFS, we must do this within the Docker application.

To see some demonstrations, run the following cells after connecting to the Jupyter Server:

In [1]:
from hdfs import Client
client = Client('http://namenode:9870') # the client needs to connect to the WebHDFS

In [None]:
client.list('/') # all path begins with /

In [None]:
client.makedirs('/demo')
client.list('/')

In [4]:
!echo "let's make a text file and put some words into it" > demo.txt
client.upload('/demo/demo.txt', 'demo.txt')

'/demo/demo.txt'

In [5]:
print(client.list('/demo'))
with client.read('/demo/demo.txt') as file:
    print(file.read())

['demo.txt']
b"let's make a text file and put some words into it\n"


# **Accessing HDFS from PySpark Session**

To read or write to a file on the HDFS with PySpark, you just simply specify the path as `hdfs://namenode:9000/path/to/file`.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').getOrCreate()
sc = spark.sparkContext

In [9]:
the_prev_txt = sc.textFile('hdfs://namenode:9000/demo/demo.txt')
the_prev_txt.take(5)

["let's make a text file and put some words into it"]