# Set up Cluster Nodes Dockerized Hadoop


In this lab you will launch a single node Hadoop cluster using Docker and run MapReduce jobs.

Skills Network Labs (SN Labs) is a virtual lab environment used in this course.  Upon clicking the "Open Tool" button below, your Username and Email will be passed to SN Labs and will be used in strict accordance with IBM Skills Network Privacy policy, such as for communicating important information to enhance your learning experience. 

## What is a Hadoop Cluster?

A Hadoop cluster is a collection of computers, known as nodes, that are networked together to perform parallel computations on big data sets. The Name node is the master node of the Hadoop Distributed File System (HDFS). It maintains the meta data of the files in the RAM for quick access. An actual Hadoop Cluster setup involves extensives resources which are not within the scope of this lab. In this lab, you will use dockerized hadoop to create a Hadoop Cluster which will have:

1- Namenode

2- Datanode

3- Node Manager

4- Resource manager

5- Hadoop history server

## Objectives

Run a dockerized Cluster Hadoop instance

Create a file in the HDFS and view it on the GUI

## Set up Cluster Nodes Dockerized Hadoop

1- Start a new terminal

In [53]:
git clone https://github.com/ibm-developer-skills-network/ooxwv-docker_hadoop.git

In [None]:
cd ooxwv-docker_hadoop

In [None]:
docker-compose up -d

In [None]:
docker exec -it namenode /bin/bash

## Explore the hadoop environment

As you have learnt in the videos and reading thus far in the course, a Hadoop environment is configured by editing a set of configuration files:

  * hadoop-env.sh Serves as a master file to configure YARN, HDFS, MapReduce, and Hadoop-related project settings.

  * core-site.xml Defines HDFS and Hadoop core properties

  * hdfs-site.xml Governs the location for storing node metadata, fsimage file and log file.

  * mapred-site-xml Lists the parameters for MapReduce configuration.

  * yarn-site.xml Defines settings relevant to YARN. It contains configurations for the Node Manager, Resource Manager, Containers, and Application Master.

For the docker image, these xml files have been configured already. You can see these in the directory /opt/hadoop-3.2.1/etc/hadoop/ by running

In [None]:
ls /opt/hadoop-3.2.1/etc/hadoop/*.xml

## Create a file in the HDFS

  * In the HDFS, create a directory structure named user/root/input.

  * Copy all the hadoop configuration xml files into the input directory.

  * Create a data.txt file in the current directory.

  * Copy the data.txt file into /user/root.

  * Check if the file has been copied into the HDFS by viewing its content.

In [None]:
hdfs dfs -mkdir -p /user/root/input

In [None]:
hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml /user/root/input

In [None]:
curl https://raw.githubusercontent.com/ibm-developer-skills-network/ooxwv-docker_hadoop/master/SampleMapReduce.txt --output data.txt 

In [None]:
hdfs dfs -put data.txt /user/root/

In [None]:
hdfs dfs -cat /user/root/data.txt

* Deployed Hadoop using Docker

* Created data in HDFS and viewed it on the GUI


Tweet and share your achievement!

Author(s)

Lavanya T S

| Date (YYYY-MM-DD) | Version | Changed By        | Change Description                                                |
| ----------------- | ------- | ----------------- | ----------------------------------------------------------------- |
| 18-01-2022	    | 1.0	  | Lavanya	          | Created lab instructions for Hadoop Cluster                       |
| 01-09-2022	    | 1.1	  | K Sundararajan	  | Updated instructions for Launch Application as per new Theia IDE  |
| 13-02-2023	    | 1.2	  | K Sundararajan	  | Updated screenshots                                               |              