<DIV ALIGN=CENTER>

# Introduction to Hadoop
## Professor Robert J. Brunner
  
</DIV>  
----- 
-----

## Introduction

In this Notebook, we will introduce Hadoop and the Hadoop Distributed
File System, which underlie the entire Hadoop ecosystem. Our setup will
be using a single Hadoop node, which will not be very fast, especially
when compared to simply running Python code directly. However, even with
this simple setup, the full Hadoop process will be demonstrated,
including the use of the Hadoop file system (HDFS) and the Hadoop
Streaming process model.

Typically, Hadoop is operated on a large cluster that runs both Hadoop
and HDFS, although with the development of Yarn, more diverse workflows
are now possible. In this Notebook, we only explore the basic Hadoop
components of Hadoop and HDFS, which work together to run code on the
nodes that hold the relevant data in order to maximize throughput.
[Other resources][hort] exist to learn more about Yarn and other Hadoop
workflows, and we will explore using [Pig][nb3] this week, and using
Spark in Week 14.

In the first code cell, we simple test the Hadoop is working in our
course [Docker container][info-hadoop]. We do this by displaying the
contents of the most recent Hadoop log files. If the files don't exist,
or do not have current timestamps, you will need to explore whether
Hadoop is working correctly before proceeding through the rest of this
Notebook.

-----
[hort]: http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
[wcp]: https://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

[nb3]: intro2pig.ipynb

[info-hadoop]: https://github.com/UI-DataScience/docker-info490/tree/master/hadoop

In [1]:
%%bash
#!/usr/bin/env bash

echo '##### Out File #####'
out_file=$(ls -la /usr/local/hadoop/logs/hadoop-data*.out | head -1 | awk '{print $9}')
cat  $out_file

echo
echo '##### Log File #####'
log_file=$(ls -la /usr/local/hadoop/logs/hadoop-data*.log | head -1 | awk '{print $9}')
tail -10  $log_file

##### Out File #####
ulimit -a for user data_scientist
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 15504
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1048576
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

##### Log File #####
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
	at java.io.DataInputStream.readInt(DataInputStream.java:392)
	at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1084)
	at org.a

-----

## Setup Local Hadoop Environment


-----

In [2]:
%%bash

# make sure we stop the namenode and datanodes if there are any running from previous run
$HADOOP_PREFIX/sbin/stop-dfs.sh
$HADOOP_PREFIX/sbin/stop-yarn.sh

# Clean up temp files if there are any created during the previous Hadoop operation.
rm -rf /tmp/*

# Format the namenode and delete all files in our HDFS.
echo "Y" | $HADOOP_PREFIX/bin/hdfs namenode -format 2> /dev/null

Stopping namenodes on [6b90ae080522]
6b90ae080522: no namenode to stop
localhost: no datanode to stop
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: no secondarynamenode to stop


In [3]:
%%bash

# Restart namenode and datanodes
$HADOOP_PREFIX/etc/hadoop/hadoop-env.sh
$HADOOP_PREFIX/sbin/start-dfs.sh
$HADOOP_PREFIX/sbin/start-yarn.sh

Starting namenodes on [6b90ae080522]
6b90ae080522: starting namenode, logging to /usr/local/hadoop/logs/hadoop-data_scientist-namenode-6b90ae080522.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-data_scientist-datanode-6b90ae080522.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-data_scientist-secondarynamenode-6b90ae080522.out


In [4]:
%%bash

# Sometimes when the namenode is restarted, it enteres Safe Mode, 
# not allowing any changes to the file system. 
# We do want to make changes, so we manually leave Safe Mode.

$HADOOP_PREFIX/bin/hdfs dfsadmin -safemode leave

$HADOOP_PREFIX/bin/hdfs dfs -mkdir -p /user/$NB_USER

Safe mode is OFF


-----

## HDFS

Next, we need to move our data to process into the Hadoop Distributed
File system, or HDFS. HDFS is a a file system that is designed to work
effectively with the Hadoop environment. In a typical Hadoop cluster,
files would be broken up and distributed to different Hadoop nodes. The
processing is moved to the data in this model, which can produce high
throughput, especially for map/reduce programming tasks. However, this
means you can not simply move around the HDFS file system in the same
manner as a traditional Unix file system, since the components of a
particular file are not all collocated. Instead, we must use the [HDFS
file system interface][hdfs], which is invoked by using
`$HADOOP_PREFIX/bin/hdfs`. Running this command by itself in our Hadoop
Docker container will list the available commands, as shown in the
following code cell.

-----

[hdfs]: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#dfs



In [5]:
!$HADOOP_PREFIX/bin/hdfs

Usage: hdfs [--config confdir] [--loglevel loglevel] COMMAND
       where COMMAND is one of:
  dfs                  run a filesystem command on the file systems supported in Hadoop.
  classpath            prints the classpath
  namenode -format     format the DFS filesystem
  secondarynamenode    run the DFS secondary namenode
  namenode             run the DFS namenode
  journalnode          run the DFS journalnode
  zkfc                 run the ZK Failover Controller daemon
  datanode             run a DFS datanode
  dfsadmin             run a DFS admin client
  haadmin              run a DFS HA admin client
  fsck                 run a DFS filesystem checking utility
  balancer             run a cluster balancing utility
  jmxget               get JMX exported values from NameNode or DataNode.
  mover                run a utility to move block replicas across
                       storage types
  oiv                  apply the offline fsimage viewer to an fsimage


-----

The standard command we will use is `dfs` which runs a filesystem
command on the HDFS file system that is supported by Hadoop. The [list
of supported `dfs` commands][dfsl] is extensive, and mirrors many of the
traditional Unix file systems commands. The full listing can be obtained
by entering `$HADOOP_PREFIX/bin/hdfs dfs` at the prompt in our Hadoop
Docker container, as shown below.

-----

[dfsl]: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html

In [6]:
!$HADOOP_PREFIX/bin/hdfs dfs

Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
	[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] <path> ...]
	[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] <path> ...]
	[-expunge]
	[-find <path> ... <expression> ...]
	[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] <src> <localdst>]
	[-help [cmd ...]]
	[-ls [-d] [-h] [-R] [<path> ...]]
	[-mkdir [-p] <path> ...]
	[-moveFromLocal <localsrc> ... <dst>]
	[-moveToLocal <src> <localdst>]
	[-mv <src> ... <dst>]
	[-put [-f] [-p] [-l] 

-----

 Some of the more useful commands for this class
include:

- `-cat`: copies the source path to STDOUT.

- `-count -h`: counts the number of directories, files and byts under the
path specified. With the `-h` flag, the output is displayed in a
human-readable format.

- `-expunge`: empties the trash. By default, files and directories are
not removed from HDFS with the `rm` command, they are simply moved to the
trash. This can be useful when HDFS supplies a `Name node is in safe
mode.` message. 

- `-ls`: lists the contents of the indicated directory in HDFS.

- `-mkdir -p`: creates a new directory in HDFS at the specified
location. With the `-p` flag any parent directory specified in the full
path will also be created as necessary.

- `-put`: copies indicated file(s) from local host file system into the
specified path in HDFS.

- `-rm -f -r`: delete the indicated file or directory. With the `-r -f`
flags, the command will not display any message and any will delete any
files or directories under the indicated directory. The `-skipTrash`
flag should be used to delete the indicated resource immediately.

- `-tail`: display the last kilobyte of the indicated file to STDOUT.

-----

In [7]:
# Display HDFS root directory structure
!$HADOOP_PREFIX/bin/hdfs dfs -ls /

Found 2 items
drwxrwx---   - data_scientist supergroup          0 2016-11-01 20:52 /tmp
drwxr-xr-x   - data_scientist supergroup          0 2016-10-31 21:09 /user


In [8]:
# Display HDFS directory
!$HADOOP_PREFIX/bin/hdfs dfs -ls /user/

Found 1 items
drwxr-xr-x   - data_scientist supergroup          0 2016-11-01 21:54 /user/data_scientist


In [9]:
# Not a local filesystem directory so we get an error
!ls /user

ls: cannot access /user: No such file or directory


In [10]:
# Free Space
!$HADOOP_PREFIX/bin/hdfs dfs -df -h

Filesystem                  Size   Used  Available  Use%
hdfs://6b90ae080522:9000  18.2 G  784 K    970.6 M    0%


-----

### Hadoop Example

We can now turn to a complete Hadoop example. This example will run the
`grep` command over a set of input files to search for the occurrences of
a particular regular expression, which in this case is the three
character sequence _dfs_ followed by one or more lowercase characters or
a period. Hadoop tasks read inout from the HDFS filesystem and write
their output to the HDFS file system. Thus we need to create an input
directory, move our data to this input directory, before we can execute
our Hadoop task. 

As demonstrated in the following code cell, we can do this easily with
HDFS commands, after which we execute our specific Hadoop task. Notice
how we include the input and output directories as part of the task
execution. In some cases these will be indicated by flags. Finally, we
display the directory contents, which demonstrate the successful
completion of this task. Since Hadoop tasks can involve complex
operations over a distributed file system, Hadoop tasks, by default,
display a considerably quantity of information. You can capture the
STDERR of any Hadoop (or HDFS) command to hide these informational
messages. Of course, this will also hide any error messages, so proceed
carefully if employing this technique.


-----

In [11]:
%%bash

# Example derived from Hadoop single node setup documentation:
#
# https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
#

cd $HADOOP_PREFIX

# Remove old directory (if it exsits) to have clean example
bin/hdfs dfs -rm -r -f hadoop

# Make directorties for example application
bin/hdfs dfs -mkdir -p hadoop
bin/hdfs dfs -mkdir -p hadoop/input

# Copy data into example input directory
bin/hdfs dfs -put etc/hadoop/*.xml hadoop/input

# Running Hadoop example to test installation
example_file=$(ls share/hadoop/mapreduce/hadoop-mapreduce-examples*)
bin/hadoop jar $example_file grep hadoop/input hadoop/output 'dfs[a-z.]+'

# Display directory heirarchy
echo
echo 'Hadoop Directory'
bin/hdfs dfs -ls hadoop/

echo
echo 'Hadoop Input Directory'
bin/hdfs dfs -ls hadoop/input

echo
echo 'Hadoop Output Directory'
bin/hdfs dfs -ls hadoop/output

Deleted hadoop

Hadoop Directory
Found 1 items
drwxr-xr-x   - data_scientist supergroup          0 2016-11-01 22:25 hadoop/input

Hadoop Input Directory
Found 9 items
-rw-r--r--   1 data_scientist supergroup       4436 2016-11-01 22:25 hadoop/input/capacity-scheduler.xml
-rw-r--r--   1 data_scientist supergroup        158 2016-11-01 22:25 hadoop/input/core-site.xml
-rw-r--r--   1 data_scientist supergroup       9683 2016-11-01 22:25 hadoop/input/hadoop-policy.xml
-rw-r--r--   1 data_scientist supergroup        354 2016-11-01 22:25 hadoop/input/hdfs-site.xml
-rw-r--r--   1 data_scientist supergroup        620 2016-11-01 22:25 hadoop/input/httpfs-site.xml
-rw-r--r--   1 data_scientist supergroup       3518 2016-11-01 22:25 hadoop/input/kms-acls.xml
-rw-r--r--   1 data_scientist supergroup       5511 2016-11-01 22:25 hadoop/input/kms-site.xml
-rw-r--r--   1 data_scientist supergroup        357 2016-11-01 22:25 hadoop/input/mapred-site.xml
-rw-r--r--   1 data_scientist supergroup       170

16/11/01 22:25:19 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
16/11/01 22:25:31 INFO client.RMProxy: Connecting to ResourceManager at 6b90ae080522/172.17.0.1:8032
16/11/01 22:25:32 INFO ipc.Client: Retrying connect to server: 6b90ae080522/172.17.0.1:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
16/11/01 22:25:33 INFO ipc.Client: Retrying connect to server: 6b90ae080522/172.17.0.1:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
16/11/01 22:25:34 INFO ipc.Client: Retrying connect to server: 6b90ae080522/172.17.0.1:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
16/11/01 22:25:35 INFO ipc.Client: Retrying connect to server: 6b90ae080522/172.17.0.1:8032. Already tried 3 time(s); re

-----

As shown in the output directory display, several files were created by
our Hadoop task. The first file, `_SUCCESS`, is self-explanatory,
especially when you see the file is empty. The second file,
`part-r-00000` contains the output of the command. We can display the
contents of this file by using the HDFS `cat` command as shown in the
following cell.

-----

In [12]:
!$HADOOP_PREFIX/bin/hdfs dfs -cat hadoop/output/part-r-00000

cat: `hadoop/output/part-r-00000': No such file or directory


-----

Of course, we want to know if this output is correct. To test this, we
can use standard Unix `grep` command to find the expected output, which
shows the same four lines (albeit we don't use a full regular
expression, so the whole line is displayed by default).

-----

In [13]:
!grep --color 'dfs[a-z.]' $HADOOP_PREFIX/etc/hadoop/*.xml

[35m[K/usr/local/hadoop/etc/hadoop/hadoop-policy.xml[m[K[36m[K:[m[K    [01;31m[Kdfsa[m[Kdmin and mradmin commands to refresh the security policy in-effect.
[35m[K/usr/local/hadoop/etc/hadoop/hdfs-site.xml[m[K[36m[K:[m[K        <name>[01;31m[Kdfs.[m[Kreplication</name>
[35m[K/usr/local/hadoop/etc/hadoop/hdfs-site.xml[m[K[36m[K:[m[K        <name>[01;31m[Kdfs.[m[Knamenode.rpc-bind-host</name>
[35m[K/usr/local/hadoop/etc/hadoop/hdfs-site.xml[m[K[36m[K:[m[K        <name>[01;31m[Kdfs.[m[Knamenode.servicerpc-bind-host</name>


-----

## Acquiring Data 

In the next two lessons, we will analyze text data by using Hadoop
map-reduce and Pig. As a result, we finish this Notebook by acquiring a
text data set for later analysis, which we stage locally (i.e., outside
HDFS). First, we delete our local directory if it exists and create it
to have a clean install.


-----

In [14]:
%%bash
#!/usr/bin/env bash
# A Bash Shell Script to delete the Hadoop diorectory if it exists, afterwhich
# make a new Hadoop directory

# Our directory name
DIR=$HOME/hadoop

# Delete if exists
if [ -d "$DIR" ]; then
    rm -rf "$DIR"
fi

# Now make the directory
mkdir "$DIR"

ls -la $DIR

total 8
drwxr-xr-x  2 data_scientist users 4096 Nov  1 22:45 .
drwxr-xr-x 23 data_scientist users 4096 Nov  1 22:45 ..


-----

### Acquiring Data

To perform data analysis by using Hadoop, we will need a data set. In the
Notebook for the second lesson this week, we will perform a simple
map/reduce operation that will require text data to operate. While there
are a number of possible options, for this example we can grab a free
book from [Project Gutenberg][pg]:

    wget --directory-prefix=$HOME/hadoop/ --output-document=book.txt \
        http://www.gutenberg.org/cache/epub/4300/pg4300.txt`

In this case, we have grabbed the full text of the novel _Ulysses_, by
James Joyce, and placed the text in the `hadoop` subdirectory of our
_home_ directory.

-----
[pg]: http://www.gutenberg.org

In [18]:
# Grab a book to process
!wget --output-document=$HOME/hadoop/book.txt \
http://www.gutenberg.org/files/4300/4300-0.txt

# On Course JupyterServer we simply copy the text from the data directory
# Since the Gutenberg site would otherwise think the students were launching a
# denial of service attack as they would all come from the same site.

#!cp /home/data_scientist/data/book.txt $HOME/hadoop/book.txt

--2016-11-01 23:59:20--  http://www.gutenberg.org/files/4300/4300-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1580914 (1.5M) [text/plain]
Saving to: ‘/home/data_scientist/hadoop/book.txt’


2016-11-01 23:59:21 (1.24 MB/s) - ‘/home/data_scientist/hadoop/book.txt’ saved [1580914/1580914]



-----

At this point, we first need to create an directory to hold the input
and output of our Hadoop task. We will create a new directory called
`wc` with a subdirectory called `in` to hold the input data for our
Hadoop task. Second, we will need to copy the book text file into this
new HDFS directory. This means we will need to run the following two
Hadoop commands:

1. `$HADOOP_PREFIX/bin/hdfs dfs -mkdir -p wc/in`
2. `$HADOOP_PREFIX/bin/hdfs dfs -put book.txt wc/in/book.txt`

The following cell contains these commands (and two other commands) the
result of running these two commands, as well as the `dfs -ls` command
to display the contents of our new HDFS directory, and the `dfs -count`
command to show the size of the directory contents. At the end of this
output will be a message from Hadoop, which simply states that files are
being immediately deleted. This value can be changed to cache files
before deleting for a specific time interval, which would, of course,
allow files to be recovered if accidentally deleted.

-----

In [16]:
%%bash

cd $HADOOP_PREFIX

bin/hdfs dfs -rm -r -f wc

echo
echo 'Creating input directory, and copying data.'
bin/hdfs dfs -mkdir -p wc/in
bin/hdfs dfs -put $HOME/hadoop/book.txt wc/in/book.txt

echo
echo 'Input directory contents'
bin/hdfs dfs -count -h wc/in/*

Deleted wc

Creating input directory, and copying data.

Input directory contents
           0            1            644.7 K wc/in/book.txt


16/11/01 22:45:55 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.


In [17]:
%%bash

# Having the namenode and datanodes running in the background consumes quite a bit of memory. So I think we should shut down the nodes at the end of the notebook:

$HADOOP_PREFIX/sbin/stop-dfs.sh
$HADOOP_PREFIX/sbin/stop-yarn.sh

Stopping namenodes on [6b90ae080522]
6b90ae080522: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode


-----

### Student Activity

In the preceding cells, we introduced Hadoop and the HDFS file system.
Now that you have run the Notebook, go back and make the following
changes to see how the results change.

1. Create a new Hadoop HDFS directory, use your own name for the
directory name.
2. Copy one or more local files into your new Hadoop directory. Run a
Hadoop command to display the files and their byte count, do the results
agree with your local values?
3. Run a different `grep` example on the book you downloaded. Do the
results make sense?

-----