+ Can I use LZO compression?
Sure! Just add the --lzo flag to your import like so:
$> zohmg import mapper.py data/ --lzo
+ Can I pass any argument to Dumbo and/or Hadoop Streaming?
Sure! Just append the arguments of your choice like so:
$> zohmg import mapper.py data/ -hadoopconf someoption=yeah
The Dumbo wiki has more information on what options other than
hadoopconf are supported: http://wiki.github.com/klbostee/dumbo/running-programs
+ My mapper raises an ImportError. What's wrong?
The mapper fails to import a module. Make sure you put the eggs for
all non-standard python modules that your mapper use in the lib
directory of your application.
+ What does this exception mean and how do I fix it?
java.lang.NullPointerException
at org.apache.hadoop.io.BytesWritable.(BytesWritable.java:50)
at org.apache.hadoop.typedbytes.TypedBytesWritable.(TypedBytesWritable.java:41)
at org.apache.hadoop.streaming.io.TypedBytesOutputReader.getLastOutput(TypedBytesOutputReader.java:73)
<snip>
Check the stdout and stderr in the Hadoop webui for more information. It could be:
* That your mapper outputs all the dimensions you have specified in your dataset.yaml file.
* That there's a python library missing, check the DEPENDENCIES file for more information.
+ What does this error mean?
TypeError: unsupported operand type(s) for +: 'int' and 'str'
The units you output have to be of type int. In this case it's a string instead,
you can typecast it like so: int(variable)
+ What does this error mean?
TypeError: sequence item 0: expected string, int found
The dimension attributes must be strings. In this case it's an int, so
you would have to typecast it like so: str(variable).
+ How can I speed up my mapper?
If you have the native libs built, enabling compression of map output may speed things up.
Set the flag like so: zohmg import <mapper> <input> -jobconf mapred.compress.map.output=true
+ How do I bundle python modules that my code needs?
Make an egg out of the module with easy_install and put that egg in the lib directory:
$> easy_install -zmaxd . yrfavoritemodule
$> cp yrfavoritemodule-0.1.egg ~/zohmg/lib/egg
+ Why am I getting 'OSError: Permission denied' when running an import?
It could be that you do not have write permissions to
'lib/usermapper.py'. Make sure the user you're running the job as has
sufficient permissions.
+ I get this error when installing:
build/temp.linux-i686-2.5/check_libyaml.c:2:18: error: yaml.h: No such file or directory
etc. Should I worry?
There's likely no reason to worry. PyYaml emits this error when it
tries (and fails) to build a C version of the yaml code. If you can
'import yaml' at the python interpreter you're golden.
+ I get this error when installing:
zipimport.ZipImportError: bad local file header in/usr/lib/python2.5/site-packages/zohmg-0.0.1-py2.5.egg
What's wrong?
That happens for no good reason when re-installing zohmg. Running the
installation again always seems to fix it (in fact, install.py already
tried a second time and is likely to have suceeded).
+ I get this error when importing: ImportError: cannot import name map
It seems like your mapper python file does not define a map() function.
+ My imports runs just fine locally but seem to do the wrong thing when I run
them on the cluster. What's up?
Make sure you don't have any old eggs laying around on the cluster node. Zohmg
ships everything it needs, so there should be no need to have the zohmg egg, for
example, installed on the nodes.
+ I see this on Hadoop 0.20 and HBase 0.20/trunk:
2009-06-16 14:36:40,464 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
java.lang.NoClassDefFoundError: org/apache/zookeeper/Watcher
That's funny, it looks as if the Zookeeper jar is not on the classpath. Make sure it is.
+ I see this on Hadoop 0.20 and HBase 0.20/trunk:
2009-06-16 14:48:26,495 FATAL org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Fail to read properties from zoo.cfg
java.io.IOException: zoo.cfg not found
You need to have zoo.cfg in Hadoop's conf/ directory.
$> cp -v $HBASE_HOME/conf/zoo.cfg $HADOOP_HOME/conf