Skip to content
Daniel Blazevski edited this page Aug 18, 2016 · 16 revisions

Getting started with HDFS

## Requirements Hadoop Intro Dev with 1 NameNode and at least 3 DataNodes ## Hadoop Distributed File System The Hadoop Distributed File System (HDFS) is a data store that can hold general purpose data. In many ways, HDFS is similar to a Unix-based file system (e.g. Linux, MacOSX) which has commands like `ls`, `mkdir`, `cd`, `rm`, etc. The key difference is that HDFS distributes data across multiple machines for several reasons:
  • Horizontal scaling: As the data scales larger, more and more machines are required to hold it. HDFS is specifically designed to handle adding machines to the cluster, and it scales linearly - if you double the number of DataNodes in the cluster, you can roughly double the storage capacity.

  • Fault-tolerance: As more and more machines are added to the cluster, it will become more and more likely that one of the machines fails, or is momentarily unavailable due to a network partition. To account for this, Hadoop splits files into “blocks” and replicates them onto several DataNodes, which ensures that one of the blocks of data is always available.

  • Parallel processing: Since the data is distributed in blocks across several machines, the data can be processed by multiple machines at once with a parallel processing framework like MapReduce.

HDFS is a higher-level abstraction that allows us to ignore the underlying details of blocks, replication, etc. and is relatively simple to use. Still, it’s important to understand the basic concepts of the architecture in order to troubleshoot and optimize the design of your data system.

Table of Contents
  1. Requirements

  2. Hadoop Distributed File System

  3. HDFS Architecture

3.1. Getting Started with a small file

3.2 Generate and examine large files

4 Best Practices with HDFS

4.1 Avoiding small files

4.2 Never delete raw data

4.3 Vertical Partitioning

  1. Alternatives to HDFS

5.1 Amazon S3

5.2 PFS

5.3 Tachyon

5.4 Pail

5.5 Apache Ignite

## HDFS Architecture

HDFS uses a centralized setup with several “worker” nodes that store all the data, and one “master” NameNode that stores the metadata about which DataNodes have what data. Each file of data is split into blocks (128MB chunks by default) and stores a copy of each block on multiple machines (3 by default). The NameNode also ensures that the DataNodes are healthy by waiting for them to check-in with a “heartbeat” occasionally (every 3 seconds by default). If one of the DataNodes is down for too long, one of the other copies is replicated and moved to a new DataNode.

For example, a 200MB file would be split into two blocks (let’s call them A and B) and then copied onto 3 different DataNodes. HDFS is relatively smart about how it distributes the blocks, but the details aren’t important at this level.

When a user first wants to access a block of data, the NameNode “introduces” them to the relevant DataNode by providing the metadata. Afterwards, the user communicates directly with the DataNode until the block is retrieved or the DataNode becomes unavailable. With three copies, the chance that all three blocks of a given piece of data are lost is incredibly small. However, the NameNode is still a single point of failure and, in production, would need to be a higher-quality machine with very frequent back-ups of the metadata.

To get started, ssh into the NameNode and start up HDFS, YARN, and the MapReduce JobHistory server (which was shown in the Hadoop Intro Dev).

## Getting Started with a small file
Next, create an example directory and a file named Who.txt using an editor

namenode$ mkdir ~/hadoop-examples
namenode$ nano ~/hadoop-examples/Who.txt

and copy the following text into it:

So call a big meeting
Get everyone out out
Make every Who holler
Make every Who shout shout

At this point, the file is only on the local Linux File System (LFS).

Make a new directory on the HDFS and copy the file to it with:

namenode$ hdfs dfs -mkdir /examples
namenode$ hdfs dfs -copyFromLocal ~/hadoop-examples/Who.txt /examples/Who.txt

You can view this by going to the WebUI at namenode-public-dns:50070 and using File Browser in the Utilities toolbar.

You can also view the contents from the terminal using:

namenode$ hdfs dfs -cat /examples/Who.txt

Alternatively, you could use -tail instead of -cat if the file was large. You can see the metadata for the current state of the file system exploring the directory specified in hdfs-site.xml

namenode$ ls $HADOOP_HOME/hadoop_data/hdfs/namenode/current

and you can view statistics on the file using

namenode$ hdfs fsck /examples/Who.txt  -files -blocks -locations

The NameNode keeps file system images (fsimages) that track metadata like file permissions and data block locations. These images are stored in a mostly binary format that is fast for the NameNode to process in the event that it needs to be restored quickly. When changes are made they are stored in edit logs and occasionally combined with the original fsimage file to create a new one in a process known as checkpointing. Normally, the main purpose of the secondary NameNode is to help with this checkpointing process to speed up recovery if the NameNode fails, but for our purposes the secondary and primary NameNode is on the same physical machine.

Note: Hadoop 2.0 supports a “Highly Available” feature where a “Passive NameNode” can be on stand-by in case the “Active NameNode” becomes unavailable. This should NOT be confused with the poorly-named secondary NameNode, which helps with updates rather than backing-up the NameNode.

You can also view the actual blocks of data on the underlying LFS by ssh’ing into one of the DataNodes and looking in the

$HADOOP_HOME/hadoop_data/hdfs/datanode/current

directory. The block names will be different on your system, but an example command to view the data would be

datanode$ less /usr/local/hadoop/hadoop_data/hdfs/datanode/current/BP-565032067-172.31.26.77-1432623593311/current/finalized/subdir0/subdir0/blk_1073741825

Use TAB to auto-complete the file until you find it.

When it comes to large files, HDFS automatically splits them into the default block size - we’ll see this with a quick example. Rather than downloading big files, we can use a built-in random data generator known as TeraGen. This is included in the example MapReduce jar, along with TeraSort and TeraValidate, as a common benchmarking tool.

## Generate and examine large files

Generate 10,000,000 lines of 100 random bytes (~1GB) in a new sub-directory of /examples/ called generated_data with the following command:

namenode$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen 10000000 /examples/generated_data

The resulting output should be named part-m-00000 and part-m-00001 - you can see a small sample of the output using the tail command

namenode$ hdfs dfs -tail /examples/generated_data/part-m-00000

If you return to the WebUI file browser and look in the /examples/generated_data directory, you can click on the various parts of the output and see the various blocks. Blocks 0-2 should have 134,217,728 bytes (i.e. 128*2^20 B or 128 MB) and Block 3 should have the remaining 97,346,816 bytes.

Now let’s free up the space on your system by removing this file and directory with:

namenode$ hdfs dfs -rm /examples/generated_data/*
namenode$ hdfs dfs -rmdir /examples/generated_data

Note that you’ll need to remove the files in the directory before you remove the directory. These are just a few examples of commands, but most of them are fairly intuitive and you can see the general usage by entering:

namenode$ hdfs dfs

There also many configuration values that can be tinkered with, the defaults are here.

Note: You may have noticed that you can also use the command hadoop fs rather than hdfs dfs in the above commands. In the past, every single command in Hadoop started with hadoop which got burdensome as the number > of commands grew. For this reason, some newer versions of Hadoop have deprecated this usage. Similar usage has changed with mapred and mapreduce. Consequently, there are many tutorials and documentation that will use the different versions.

## Best Practices with HDFS

There are a few best practices that aren’t strictly enforced while using HDFS, but will help you optimize your usage.

### Avoiding small files

Every file in HDFS has to have it’s own meta-data stored by the NameNode. This makes it inefficient to store lots of small files rather than a few big ones. At the same time, Hadoop needs to be able to split apart files into blocks in order to process them in parallel.

To avoid the small file inefficiency, files are usually compressed together before being “ingested” into HDFS, but not all compression formats work well with Hadoop. The LZO compression allows for blocks to be split up for MapReduce, but cannot be distributed with Hadoop due to its Gnu Public License. Google’s Snappy compression isn’t inherently splittable but works with container formats like SequenceFiles and Avro. Ingestion is an important topic that deserves it’s own Dev; check out the first two chapters of Hadoop Applications Architecture (in the Dropbox) for a good overview.

### Never delete raw data

In the above examples we deleted files because they were trivial examples. In practice, disk space on a real cluster is getting cheaper and cheaper, but the cost of accidentally deleting a file you need remains high. Even data with a few mistakes may be useful later for troubleshooting. Thus it’s generally a best practice to hold on to all data, especially raw data, and just move undesired data into a discard directory until you’re absolutely sure you don’t need it or you need more space.

### Vertical Partitioning

Hadoop typically reads and writes to directories rather than individual files (as we saw for the generated data example). For this reason, storing all your data in one big directory is a bad practice. Instead you should create directories for each type of data you have. For example, you might store the tweets of all your users in a directory named /data/tweets and the age of the users in another directory named /data/age. You can further separate the tweets by month with directories like /data/tweets/01-2015 for January /data/tweets/02-2015 for February, and so on, which allows you to process only the time periods you’re interested in. You can read more about vertical partitioning in Sections 4.5 and 4.6 (p. 62-63) of Nathan Marz’s Big Data (in the library and Dropbox).

## Alternatives to HDFS

There are many alternative to HDFS, but most of them offer the exact same functionality and they all have similar underlying architectures:

### Amazon S3 Amazon Simple Storage Service (S3) is the proprietary form of HDFS used for AWS. S3 is fairly easy to use and integrates well with other AWS services like EMR, Kinesis, and Redshift. We’ll work more with S3 later when we learn those tools. ### PFS [The Pachyderm File System](http://www.pachyderm.io/) (PFS) is a new dockerized version of HDFS that is lightweight for simple deployment. We will be learning more about Pachyderm and Docker on Wednesday of Week 1. ### Tachyon [Tachyon](http://www.alluxio.org/) is a newer distributed file system optimized for memory rather than hard disk, from the same Berkeley group (AMPLab) that developed Spark and Mesos. Tachyon is compatible with popular tools like Hadoop MapReduce and Spark, and claims to be much faster. ### Pail [Pail](https://github.com/nathanmarz/dfs-datastores) is a higher level abstraction that runs on top of HDFS to enforce many of the best practices discussed above like vertical partitioning, combining small files, and data serialization. You can learn more about Pail in Chapter 5 of Big Data (p 65-82). ### Apache Ignite [Apache Ignite](http://ignite.apache.org/) is a brand new in-memory platform that offers many of the same advantage as Tachyon and claims to accelerate all types of data processing. At this point, Ignite is bleeding edge but offers some promising advantages in the future.