## Hadoop Installation

> It's highly recommended to use a Linux system as Hadoop  is known to cause issues when used with Windows

Now that we have a high-level understanding of what Hadoop is, let's download and install it to get a better feel of how it operates. 

At a high-level, to get Hadoop working on your system, the steps involve:

- Downloading the appropriate Hadoop package (the version to select depends on several criteria such as compatability with other big data tools)
- Extracting (untar) the files and directories into an appropriate folder on your system
- Configuring the following files:

       .bashrc
       hadoop-env.sh
       core-site.xml
       hdfs-site.xml
       mapred-site-xml
       yarn-site.xml

_Note: We'll be using Linux (Ubuntu) for this tutorial, so if you're on a different operating system the steps might differ._

Let's begin:

#### 1. Open your terminal and run the following commands to __update__ all existing applications on your system:

In [None]:
sudo apt update
sudo apt -y upgrade

Once the update completes successfully, let's go ahead and reboot our system to ensure all the updates are applied:

In [None]:
# Reboot the system
sudo reboot

#### 2. Ensure that __Java__ is installed on your system. 

_Note: It is recommended to use `Java 8` as this is the version is quite popular and is compatible with Hadoop and other tools that integrate with Hadoop._

To do this, run the following command:

In [None]:
# Check the Java version
java -version

# Check the Java copmiler version
javac -version

If Java is installed correctly, you'll see an output showing the Java and compiler version.  

Otherwise, you will need to download and install Java by running the following commands (sometimes both the JRE and JDK are required, so let's go ahead and install both just in case):

In [None]:
# Install Java 8 runtime environment
sudo apt-get install openjdk-8-jre 

# Install Java 8 development kit
sudo apt-get install openjdk-8-jdk

# Add Java to the Linux repository
sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update


Check that both the JDK and the JRE have installed correctly.  If all is well, run the below command and you should see the Java version:

In [None]:
$ java -version

# Output should be similar to this
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)

In [None]:
javac -version

# Output should be similar to this
javac 1.8.0_312

_Note: If you have multiple Java versions installed, it's possible to switch between them.  To do this, check what available Java versions currently exist by running the below command:_

In [None]:
# Check Java versions already installed
sudo update-alternatives --config java

You should see a menu similar to the below one.  You can select the desired version directly by typing in its number:

In [None]:
sudo update-alternatives --config java
[sudo] password for hadoop:

# # Output should be similar to this
There are 2 choices for the alternative java (providing /usr/bin/java).

  Selection	Path                                        	Priority   Status
------------------------------------------------------------
* 0        	/usr/lib/jvm/java-11-openjdk-amd64/bin/java  	1111  	auto mode
  1        	/usr/lib/jvm/java-11-openjdk-amd64/bin/java  	1111  	manual mode
  2        	/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java   1081  	manual mode

Press <enter> to keep the current choice[*], or type selection number: 2
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode
$

Repeat the same step above for the Java compiler `javac`:

In [None]:
sudo update-alternatives --config javac

#### 3. Next, we need to ensure that the `JAVA_HOME` variable is correctly set up on your system.  To do that, run the below command:

In [None]:
echo $JAVA_HOME

If the variable is correctly set up, you should see a path show up similar to the one below:

In [None]:
echo $JAVA_HOME

# Expected output should be similar to:
/usr/lib/jvm/java-8-openjdk-amd64

If you get no output, or if the output is simply repeating JAVA_HOME, then the variable is not set up correctly.

To fix this, you will need to add the correct Java path to your `.bashrc` file.  To do this, first find the location of the Java installation, then open and update the `.bashrc` file to include the following lines at the end of the file (ensure you use the path to your Java folder):

In [None]:
# Open the .bashrc file using the Nano editor (you can use any other editor like Vim if you prefer)
sudo nano ~/.bashrc

In [None]:
# Add the below to the .bashrc file
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin

After saving the file, we need to use the `source` command to enforce the changes in the operating system:

In [None]:
source ~/.bashrc

Now, to make sure everything is set up correctly, check that `JAVA_HOME` is working.  You should now be able to see the Java folder path correctly:

In [None]:
echo $JAVA_HOME

#### 4. Next, we should create a new user account for Hadoop.  HBase uses Hadoop's HDFS to store the data, so it's recommended to have a seperation of accuonts between the Hadoop file system and the Linux file system to avoid confusion.  Go ahead and type the below commands:

In [None]:
# Create a new user called Hadoop
sudo adduser hadoop
sudo usermod -aG sudo hadoop

You will then be asked for some additional information including a password for the new user account.  Go ahead and add that information.  You should see output similar to:

In [None]:
# Expected output should be similar to the below
Adding user `hadoop' ...
Adding new group `hadoop' (1001) ...
Adding new user `hadoop' (1001) with group `hadoop' ...
Creating home directory `/home/hadoop' ...
Copying files from `/etc/skel' ...
New password:
Retype new password:
passwd: password updated successfully
Changing the user information for hadoop
Enter the new value, or press ENTER for the default
    Full Name []: 	 
    Room Number []:
    Work Phone []:
    Home Phone []:
    Other []:
Is the information correct? [Y/n] y
adiwany@dodz-vm:~$ sudo usermod -aG sudo hadoop
adiwany@dodz-vm:~$


#### 5. Now, we need to generate a __SSH key-pair__ for the new Hadoop user. Run the below commands to switch to the newly created Hadoop user and to create the new SSH key:

In [None]:
# Switch to the Hadoop user
sudo su hadoop

# Generate the SSH key-pair
ssh-keygen -t rsa

If all goes well, you should see output similar to this:

In [None]:
ssh-keygen -t rsa

# Expected output:
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:tol1mX4v1GrDLeHHq1Wa/nvKaZUMlEGDCZisTAkWVPc hadoop@dodz-vm
The key's randomart image is:
+---[RSA 3072]----+
|  .=+.o.o.. ++o  |
|  .  o.+.  o o.  |
|	o .  E  .	|
| 	o 	o .   |
|    	S +  .o o|
|   	+ =  o .*.|
|  	. o .+.=+. |
|       	.O==..|
|       	..BB=+|
+----[SHA256]-----+
hadoop@dodz-vm:~$

#### 6. Next, add this newly generated key to the list of authorized SSH keys:

In [None]:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
sudo chmod 0600 ~/.ssh/authorized_keys

If no errors come up, then everything is set up correctly so far.

#### 7. Verify that we can use SSH with the newly generated key:

In [None]:
ssh localhost

If everything runs smoothly, you should see output similar to:

In [None]:
ssh localhost

# Expected output
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is SHA256:fPYKPrq8VD1pNfI+7EXyKqQFFm4/eWi0+jjADURdHhU.
Are you sure you want to continue connecting (yes/no/[fingerprint])? y
Please type 'yes', 'no' or the fingerprint: yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 20.04.3 LTS (GNU/Linux 5.11.0-41-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management: 	https://landscape.canonical.com
 * Support:    	https://ubuntu.com/advantage

23 updates can be applied immediately.
To see these additional updates run: apt list --upgradable

Your Hardware Enablement Stack (HWE) is supported until April 2025.

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

If you get an error saying something like __Connection Refused__, then the localhost is not properly set up or SSH is not yet installed.

To install SSH, enter the following command:

In [None]:
sudo apt install ssh 

You should see output similar to this:

In [None]:
sudo apt install ssh

# Expected output 
[sudo] password for hadoop:
Reading package lists... Done
Building dependency tree  	 
Reading state information... Done
The following packages were automatically installed and are no longer required:
  chromium-codecs-ffmpeg-extra gstreamer1.0-vaapi
  libgstreamer-plugins-bad1.0-0 libva-wayland2
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
  ncurses-term openssh-server openssh-sftp-server ssh-import-id
Suggested packages:
  molly-guard monkeysphere ssh-askpass
The following NEW packages will be installed:
  ncurses-term openssh-server openssh-sftp-server ssh ssh-import-id
0 upgraded, 5 newly installed, 0 to remove and 18 not upgraded.
Need to get 693 kB of archives.
After this operation, 6,130 kB of additional disk space will be used.
Do you want to continue? [Y/n] y

#### 8. Download and install `Hadoop`

_Note:_ There are many Hadoop versions available to download and install. However, this tutorial uses version `2.10.1` as it's compatible with other tools that will be integrated with Hadoop in other notebooks. You can use any version you prefer but keep in mind that some bugs you may encounter could be different than what is covered in this notebook.

Save the version to a variable as follows:

In [None]:
Version="2.10.1"

Then download the corresponding Hadoop version as follows:

In [None]:
sudo wget https://www-eu.apache.org/dist/hadoop/common/hadoop-$Version/hadoop-$Version.tar.gz

#### 9. Extract the files and move the resulting folder to the location of your choice (we'll be using `/usr/local/hadoop`). Create the new directory if required:

In [None]:
# Extract the Hadoop files
tar -xzvf hadoop-$Version.tar.gz

# Create the new folder
sudo mkdir /usr/local/hadoop

# Remove the downloaded tar file
rm hadoop-$Version.tar.gz

# Move everything to the new folder
sudo mv hadoop-$Version/ /usr/local/hadoop

#### 10. Now, we need to set `HADOOP_HOME` and add the directory containing the Hadoop binaries to your `.bashrc` file.  To do this, run the following command:

In [None]:
sudo nano ~/.bashrc

Define the Hadoop environment variables required to configure the tool on your system.  To do this, we need to add the following content to the end of the `.bashrc` file (remember to use your own Java and Hadoop folder paths):

In [None]:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.10.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Save the file and exit.  It's vital to run the below command to apply the changes we just added:

In [None]:
source ~/.bashrc

#### 11. Next, we'll update the `hadoop-env.sh` file.  This file contains important configurations related to Hadoop's setup:

In [None]:
sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Uncomment the `$JAVA_HOME` variable (i.e., remove the # sign) and _add the full path to the OpenJDK installation on your system without the bin directory_. If you don't know the Java path, run the following command to find out:

In [None]:
readlink -f /usr/bin/javac

Your file should look something like this:
<p align="center">
  <img src="images/hadoop-env3.png" width=600>
</p>


#### 12. The `core-site.xml` file defines important HDFS and Hadoop cluster properties.  To set up Hadoop properly, we need to provide the URL of the NameNode (master node).  To do that, run the following command to open the file:

In [None]:
sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

To specify that we'll be running Hadoop locally, add the following in between the `<configuration>` and `</configuration>` tags and save the file:

In [None]:
   <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
      <description>The default file system URI</description>
   </property>

#### 13. Next, we need to edit the `hdfs-site.xml` file.  This file stores the details regarding the location of the metadata, NameNode and DataNode directories. We'll first create those directories, then update the file to contain the new directory paths. Run the below commands:

In [None]:
sudo mkdir $HADOOP_HOME/dfsdata
sudo mkdir $HADOOP_HOME/dfsdata/namenode
sudo mkdir $HADOOP_HOME/dfsdata/datanode

Next, let's update the `hdfs-site.xml` file as follows:

In [None]:
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following properties to the file in between the `<configuration>` and `</configuration>` tags. If needed, adjust the NameNode and DataNode directories to reflect your custom locations (if you used directories different than above).  We'll also set the default HDFS data replication factor to 1:

In [None]:
<property>
  <name>dfs.data.dir</name>
  <value>$HADOOP_HOME/dfsdata/namenode</value>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>$HADOOP_HOME/dfsdata/datanode</value>
</property>
<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>


#### 14. Edit the `mapred-site.xml` file to define the MapReduce required values:

In [None]:
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following property to change the default MapReduce framework value to `yarn`:

In [None]:
<property> 
  <name>mapreduce.framework.name</name> 
  <value>yarn</value> 
</property> 

_Note: If you don't find the file, it could be because it hasn't been created yet. You'll find a template to use under the name of `mapred-site.xml.template` file. Go ahead and open that file and save a copy as `mapred-site.xml`_

#### 15. Edit `yarn-site.xml` which is used to define YARN-specific settings by opening the file and adding the below:

In [None]:
sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add the following property between the `<configuration>` and `</configuration>` tags:

In [None]:
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>

This completes the required Hadoop configurations.

#### 16. Reboot your system to ensure all new settings are loaded before running Hadoop:

In [None]:
systemctl reboot -i

#### 17.  Validate Hadoop settings and configurations

After completing the above steps, we want to make sure that Hadoop was properly set up.  To do this, we should do the following to check the Hadoop version:

In [None]:
hadoop version

You should see output like this:

In [None]:
# Expected output
Hadoop 2.10.1
Subversion https://github.com/apache/hadoop -r 1827467c9a56f133025f28557bfc2c562d78e816
Compiled by centos on 2020-09-14T13:17Z
Compiled with protoc 2.5.0
From source with checksum 3114edef868f1f3824e7d0f68be03650
This command was run using /usr/lib/hadoop/hadoop-2.10.1/share/hadoop/common/hadoop-common-2.10.1.jar

#### 18. Now we are ready to start the Hadoop cluster.  To do this, we need to run a number of commands:

- `start-dfs.sh` 
    -   This command starts HDFS
- `start-yarn.sh`
    -   This command starts YARN
- `jps`
    -   This command checks all Java processes to ensure the correct daemons are active

__Note__: It's recommended to run these commands from inside your `HADOOP_HOME` folder:

In [None]:
# Start HDFS 
bash start-dfs.sh

This will take some time to start HDFS.  You should see output similar to:
<p align="center">
  <img src="images/start-hdfs.png" width=600>
</p>


Then run:

In [None]:
# Start YARN
bash start-yarn.sh
jps

You should see the below output:

<p align="center">
  <img src="images/start-yarn-jps.png" width=600>
</p>


_Note: If the `DataNode` process doesn't show up in the jps list, then we probably need to double check the correct HDFS directory is created and that we have the proper persmissions._

The commands should be similar to the below (make sure to use your HDFS path as the below is just an example):

In [None]:
# Stop the Hadoop services
bash stop-dfs.sh
bash stop-yarn.sh

# Grant the Hadoop user read/write folder access
sudo chmod -R 755 $HADOOP_HOME/dfsdata/*
sudo chown -R hadoop:hadoop $HADOOP_HOME/dfsdata

After running the above commands, we need to start the `dfs` and `yarn` services again as follows:

In [None]:
# Restart the Hadoop services again
bash start-dfs.sh
bash start-yarn.sh

Now, if you run `jps`, you should see the DataNode process correctly displayed along with the remaining Hadoop processes:

<p align="center">
  <img src="images/datanode-jps.png" width=600>
</p>

_Note: Another possible error you may encounter is the `incompatible cluster ID error`.  To resolve this, delete both the `datanode` and `namenode` folders and create them again._

_Note: If the `NameNode` process doesn't appear in the list of processes, you may have to format it first before starting the HDFS and YARN services. To do this, run the below command:_

In [None]:
# Format the NameNode
hadoop namenode -format

#### 19. Use your preferred browser and navigate to your localhost URL or IP. The default port number 9870 gives you access to the Hadoop NameNode user interface, which allows you to monitor the Hadoop enviornment:

In [None]:
http://localhost:9870

You should be able to see a page similar to this one:

<p align="center">
  <img src="images/hadoop-ui.png" width=600>
</p>

Similarly, you can use port 9864 to access the DataNode user interface:

In [None]:
http://localhost:9864

And port 8088 can be used to view the YARN Resource Manager:

In [None]:
http://localhost:8088

Now you have Hadoop installed and ready to go!

If you face persistant bugs that you can't resolve, check the Hadoop version you used and try another one as many buys are usually due to compatability issues between the Hadoop version, Java version and the operating system you are using. Different operating systems can cause different errors to occur.