Skip to content
jganitkevitch edited this page Sep 15, 2011 · 3 revisions

What if you don't have access to a hadoop cluster? Does that mean you can't use thrax? Of course you can use it! (It'll just take a lot longer.) Hadoop can be run in standalone mode on a single computer. You might say it defeats the purpose of using hadoop in the first place, but it does give you some nice things for free, like sorting records on disk and things like that. So let's get started setting up a standalone hadoop install.

For a happy medium between standalone and a full cluster, see pseudodistributed hadoop.

1. Get hadoop

It's quite easy to get everything you need in one tarball. Here's one link:

wget http://apache.cs.utah.edu//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz

More generally, the official Hadoop Common page has links to all recent versions. Once you have the tarball, you can use

tar -xzf hadoop-0.20.2.tar.gz

to unpack it.

2. Configure

The short version: set the three properties listed below to directories where you have lots of free hard disk space.

Hadoop is ready to run in standalone mode essentially as soon as you unpack it. But since standalone mode is really meant for small-scale testing, and not for production usage, you have to make some changes in the configuration if you want to use it with a large dataset. The configuration file you need to change is $HADOOP/conf/mapred-site.xml. Here's how mine looks:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/Users/jonny/mapred/tmp</value>
    </property>

    <property>
        <name>mapred.local.dir</name>
        <value>/Users/jonny/mapred/local</value>
        <final>true</final>
    </property>

    <property>
        <name>mapred.system.dir</name>
        <value>/Users/jonny/mapred/system</value>
        <final>true</final>
    </property>
</configuration>

As you can see, I added three properties:

  • hadoop.tmp.dir is the base directory where temporary hadoop files are stored during a job.
  • mapred.local.dir is where intermediate files should be stored during a job. These are things like output from map tasks, chunks of data during the shuffle, etc.
  • mapred.system.dir is where shared files are stored during a job.

The default for these settings is somewhere in the local /tmp. The problem is that whatever partition /tmp is on is almost certainly not big enough to hold all the intermediate data on a normal-sized hadoop run. That's why I reset those values to places where I know I have a lot of disk space.