New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vsihdfs Support #714
Add vsihdfs Support #714
Conversation
I didn't actually look at your code, since seeing you use Java libraries, my main question is: why not using the REST API exposed at https://hadoop.apache.org/docs/r1.0.4/webhdfs.html like the other network VSI implementation in GDAL ? At first sight the API seems reasonable, and that would simplify so much the use and deployment of vsihdfs |
I think that the primary motivation is speed/scalability. The WebAPI does look like a convenient choice for many cases, though. |
gdal/port/cpl_vsil_hdfs.cpp
Outdated
|
||
VSIHdfsFilesystemHandler::VSIHdfsFilesystemHandler() | ||
{ | ||
this->poFilesystem = hdfsConnect("default", 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the overhead of this ? This constructor will be called at GDAL startup when VSI is initialized, so if there's significant time involved, the connection should be moved to the first time a real access (open, stat, etc) is done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gdal/configure.ac
Outdated
dnl JNI | ||
dnl --------------------------------------------------------------------------- | ||
|
||
AC_ARG_WITH(jni,[ --with-jni[=ARG] Include JVM JNI support (ARG=Path or no)],,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as you mentionned in the checklist, there's already an option to detect the JVM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gdal/configure.ac
Outdated
else | ||
if test JNI_ENABLED="yes"; then | ||
AC_MSG_CHECKING([for HDFS in $with_hdfs]) | ||
AC_CHECK_LIB(hdfs,hdfsConnect,HAVE_HDFS_LIB=yes,HAVE_HDFS_LIB=no,$JNI_LIB -L$with_hdfs/lib/native) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so libhdfs is a C-api library that calls Java code ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct. A small amount of information can be found here. The $HADOOP_PREFIX/include/hdfs.h
header is the most comprehensive source of information on libhdfs that I have found.
* DEALINGS IN THE SOFTWARE. | ||
****************************************************************************/ | ||
|
||
//! @cond Doxygen_Suppress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should be an empty implementation of VSIInstallHdfsHandler() when HDFS_ENABLED is not defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gdal/port/cpl_vsil_hdfs.cpp
Outdated
vsi_l_offset Length() | ||
{ | ||
hdfsFileInfo * poInfo = hdfsGetPathInfo(this->poFilesystem, this->oFilename.c_str()); | ||
if (poInfo != NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use nullptr everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gdal/port/cpl_vsil_hdfs.cpp
Outdated
size_t Write(const void *pBuffer, size_t nSize, size_t nMemb) override; | ||
vsi_l_offset Length() | ||
{ | ||
hdfsFileInfo * poInfo = hdfsGetPathInfo(this->poFilesystem, this->oFilename.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a number of useless this-> throughout the code. If you want to distinguish local variables from class members, you could use m_xxxx or xxxxx_ convention for class member variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gdal/port/cpl_vsil_hdfs.cpp
Outdated
vsi_l_offset Tell() override; | ||
size_t Read(void *pBuffer, size_t nSize, size_t nMemb) override; | ||
size_t Write(const void *pBuffer, size_t nSize, size_t nMemb) override; | ||
vsi_l_offset Length() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as it is not a one-liner, could you move the implementation of this method outside of the class declaration ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gdal/port/cpl_vsil_hdfs.cpp
Outdated
|
||
int VSIHdfsHandle::Eof() | ||
{ | ||
return (this->Tell() == this->Length()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not completely POSIX compliant. If a file is for example one-byte large, and you read one byte in it, and then call Eof(), it should return FALSE. Eof() should only be set to TRUE if you attempt to read beyond the file, and should be reset as soon as you do a Seek()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gdal/port/cpl_vsil_hdfs.cpp
Outdated
} | ||
else { | ||
const char * pszPath = pszFilename + strlen(VSIHDFS); | ||
hdfsFile poFile = hdfsOpenFile(poFilesystem, pszPath, O_RDONLY, 0, 0, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume hdfs API is thread-safe ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the claim can be found here. The exact wording used there seems to leave a little bit of room for a negative answer if there is a deficiency within the JNI functionality, but that is not believed to be the case.
gdal/port/cpl_vsil_hdfs.cpp
Outdated
|
||
hdfsFileInfo * paoInfo = hdfsListDirectory(this->poFilesystem, pszDirname, &mEntries); | ||
if (paoInfo != NULL) { | ||
papszNames = new char*[mEntries+1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be simplified a bit by using
CPLStringList aosNames;
for (int i = 0; i < mEntries; ++i)
aosNames.AddString(paoInfo[i].mName);
return aosNames.StealList();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Is there a way to silent off the Java exceptions on failed open() attempts ? They should not be verbose ideally |
That is a good question, I will try to find an answer. EDIT: I have addressed this. |
Reported by Doxygen
|
gdal/port/cpl_vsil_hdfs.cpp
Outdated
@@ -209,6 +207,9 @@ VSIHdfsFilesystemHandler::Open( const char *pszFilename, | |||
const char *pszAccess, | |||
bool) | |||
{ | |||
if (poFilesystem == nullptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be protected by a mutex. And probably shall be done in Stat() as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gdal/port/cpl_vsil_hdfs.cpp
Outdated
return 0; | ||
} | ||
else if (bytes < 0) | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 is invalid for size_t. Should be 0
gdal/port/cpl_vsil_hdfs.cpp
Outdated
return hdfsRead(poFilesystem, poFile, pBuffer, nSize * nMemb); | ||
int bytes = hdfsRead(poFilesystem, poFile, pBuffer, nSize * nMemb); | ||
if (bytes == 0) { | ||
bEOF = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is bytes == 0 returned if you read partly in the file and partly outside ? In that situation ideally you would return the number of members successfully read, and set eof
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed Read
to have the following behavior:
The
Read
function returns a non-zero value if and only if the read
from HDFS succeeded. Given a read request of more than zero bytes,
hdfsRead
returns zero if and only if EOF has been reached, so the EOF
state is set to be true if that is the case.hdfsRead
returns -1 if
and only if the read failed, in that caseRead
returns zero with the
EOF state cleared.
The hdfsRead
function returns zero (for a non-zero requested range) if and only if EOF has been reached:
@return On success, a positive number indicating how many bytes
were read.
On end-of-file, 0.
On error, -1. Errno will be set to the error code.
Just like the POSIX read function, hdfsRead will return -1
and set errno to EINTR if data is temporarily unavailable,
but we are not yet at the end of the file.
so I think that the behavior should now be correct. errno
is also properly set after this code has run in case that is used anywhere.
* DEALINGS IN THE SOFTWARE. | ||
****************************************************************************/ | ||
|
||
//! @cond Doxygen_Suppress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This //! @cond Doxygen_Suppress is not properly balanced with the corresponding //! @endcond if you take into account #ifdef / #endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. I don't have doxygen installed locally so I was reduced to guessing. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I may have cleaned these issues up, finally.
Doxygen still complaining since it takes the dummy code path
You should be able to make it happy be defining HDFS_ENABLED in the PREDEFINED variable of the gdal/Doxyfile file Windows build also fail since VSIInstallHdfsHandler() is called, but not defined. You should add cpl_vsil_hdfs.obj to port/makefile.vc |
53c14ea
to
8898478
Compare
gdal/port/cpl_vsil_hdfs.cpp
Outdated
@@ -260,8 +267,10 @@ VSIHdfsFilesystemHandler::Open( const char *pszFilename, | |||
int | |||
VSIHdfsFilesystemHandler::Stat( const char *pszeFilename, VSIStatBufL *pStatBuf, int) | |||
{ | |||
memset(pStatBuf, 0, sizeof(*pStatBuf)); // XXX | |||
CPLMutexHolder oHolder( &hMutex ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poFilesystem might be uninitialized at that point if Open() has never been called
gdal/port/cpl_vsil_hdfs.cpp
Outdated
@@ -61,6 +61,16 @@ CPL_CVSID("$Id$") | |||
/* ==================================================================== */ | |||
/************************************************************************/ | |||
|
|||
#define SILENCE(expr) {\ | |||
int hOldStderr = dup(2);\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case the process run short of file handles available, dup() and open() could fail. It would be prudent to check them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, done.
gdal/port/cpl_vsil_hdfs.cpp
Outdated
@@ -217,6 +217,11 @@ class VSIHdfsFilesystemHandler final : public VSIFilesystemHandler | |||
VSIHdfsFilesystemHandler(); | |||
~VSIHdfsFilesystemHandler() override; | |||
|
|||
void EnsureFilesystem() { | |||
if (poFilesystem == nullptr) | |||
poFilesystem = hdfsConnect("default", 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inconsistent indentation. And implementation would be best outside of the class declaration. And you could also move the CPLMutexHolder in this method, since it is needed just for this lazy instanciation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, done.
I think that all current comments have been satisfied. |
Yep, but there are still missing items from my point of view:
|
Okay, sounds good, I will get to those as soon as I can. |
def vsihdfs_1(): | ||
filename = '/vsihdfs/file:' + os.getcwd() + '/data/text.txt' | ||
fp = gdal.VSIFOpenL(filename, 'rb') | ||
if fp is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that this is an optionally compiled VFS, tests should return 'skip' if the VFS is not present. Unfortunately there's no easy way of knowing if a VFS is compiled in or not, so presumably you would use that initial failure as a sign it is missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I have implemented this.
autotest/gcore/vsihdfs.py
Outdated
return 'fail' | ||
|
||
data = gdal.VSIFReadL(5, 1, fp) | ||
if not data or data != 'Lorem': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you likely need a data.decode('ascii') != 'Lorem' (and in similar following call sites) for Python3 compat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, done.
A few Pyflakes warnings raised
|
Okay, I think that those have been corrected. |
# Read test | ||
def vsihdfs_1(): | ||
filename = '/vsihdfs/file:' + os.getcwd() + '/data/text.txt' | ||
fp = gdal.VSIFOpenL(filename, 'rb') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could set here gdaltest.have_vsihdfs = False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I have done this.
autotest/gcore/vsihdfs.py
Outdated
fp = gdal.VSIFOpenL(filename, 'rb') | ||
if fp is None: | ||
return 'skip' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here gdaltest.have_vsihdfs = True, and then in the following test step you would test this instead of trying to reopen the file when you don't need it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I have done this.
|
||
/vsihdfs/ is a file system handler that provides read access to HDFS. | ||
This handler requires that GDAL to have been built with Java support | ||
(--with-java) and HDFS support (--with-hdfs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add that only Unix build support is implemented currently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
|
||
Examples: | ||
<pre> | ||
/vsizip/file:/tmp/my.tif (a local file accessed through HDFS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the examples look wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed.
Squashed and merged |
Thanks much |
import org.gdal.gdal.Dataset; import org.gdal.gdal.gdal; public class Tifovr2Jpg {
} root@Kylin:/home/gdalTest# java Tifovr2Jpg /vsihdfs/hdfs://192.168.56.46:8020/home/chdh/rawdata/chdl/image/06490082/20220415000000/06490082.tif root@Kylin:/home/gdalTest# gdalinfo /vsihdfs/hdfs://192.168.56.46:8020/home/chdh/rawdata/chdl/image/06490082/20220415000000/06490082.tif I generated files such as gdal.jar through make in the swig/java directory of gdal, and placed them in the ext path of jdk. When running my java project, the above error occurred. When accessing the file through gdalinfo, there was no problem appear. |
import org.gdal.gdal.Dataset;
import org.gdal.gdal.Driver;
import org.gdal.gdal.gdal;
import org.gdal.gdalconst.gdalconstConstants;
public class Tifovr2Jpg {
public static void main(String[] args) {
gdal.AllRegister();
gdal.SetConfigOption("gdal_FILENAME_IS_UTF8", "NO");
System.out.println("下标为0的参数:"+args[0]);
Dataset ds=gdal.Open(args[0], gdalconstConstants.GA_ReadOnly);
System.out.println(ds);
//判断数据源是否非空
if (ds==null){
System.out.println(" Dateset为空... ");
System.exit(1);
}
Driver driver = ds.GetDriver();
System.out.println("driver : "+driver);
}
} root@Kylin:/home/gdalTest# java Tifovr2Jpg /vsihdfs/hdfs://192.168.56.46:8020/home/chdh/rawdata/chdl/image/06490082/20220415000000/06490082.tif
下标为0的参数:/vsihdfs/hdfs://192.168.56.46:8020/home/chdh/rawdata/chdl/image/06490082/20220415000000/06490082.tif
hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error:
(unable to get stack trace for java.lang.NoClassDefFoundError exception: ExceptionUtils::getStackTrace error.)
hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error:
(unable to get stack trace for java.lang.NoClassDefFoundError exception: ExceptionUtils::getStackTrace error.)
hdfsGetPathInfo(hdfs://192.168.56.46:8020/home/chdh/rawdata/chdl/image/06490082/20220415000000/06490082.tif): constructNewObjectOfPath error:
(unable to get stack trace for java.lang.NoClassDefFoundError exception: ExceptionUtils::getStackTrace error.)
ERROR 4: `/vsihdfs/hdfs://192.168.56.46:8020/home/chdh/rawdata/chdl/image/06490082/20220415000000/06490082.tif' does not exist in the file system, and is not recognized as a supported dataset name.
null
Dateset为空... root@Kylin:/home/gdalTest# gdalinfo /vsihdfs/hdfs://192.168.56.46:8020/home/chdh/rawdata/chdl/image/06490082/20220415000000/06490082.tif
Driver: GTiff/GeoTIFF
Files: /vsihdfs/hdfs://192.168.56.46:8020/home/chdh/rawdata/chdl/image/06490082/20220415000000/06490082.tif
Size is 11251, 7501
Coordinate System is `'
Image Structure Metadata:
INTERLEAVE=PIXEL
Corner Coordinates:
Upper Left ( 0.0, 0.0)
Lower Left ( 0.0, 7501.0)
Upper Right (11251.0, 0.0)
Lower Right (11251.0, 7501.0)
Center ( 5625.5, 3750.5)
Band 1 Block=11251x16 Type=Byte, ColorInterp=Red
Band 2 Block=11251x16 Type=Byte, ColorInterp=Green
Band 3 Block=11251x16 Type=Byte, ColorInterp=Blue I generated files such as gdal.jar through make in the swig/java directory of gdal, and placed them in the ext path of jdk. When running my java project, the above error occurred. When accessing the file through gdalinfo, there was no problem appear. |
Hello, thank you for your question. It is hard to say with precision what is going on, but the fact that When you use the Therefore, when you write a Java program that uses the I might suggest trying the |
What does this PR do?
This pull request adds (read-only) support for HDFS in the form of a
vsihdfs
driver.What are related issues/pull requests?
Tasklist
Environment
Provide environment details, if relevant:
Mostly developed on Linux Mint 17.3 (effectively Ubuntu 14.04), should be fairly portable to any Unix-like system.
To Build
./autogen.sh
in the root of the repository.JAVA_HOME
environment variable. In my case,export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
LD_LIBRARY_PATH
environment variable. In my caseexport LD_LIBRARY_PATH=$HOME/local/hadoop-2.7.6/lib/native:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server:$LD_LIBRARY_PATH
. That is necessary for final link steps at the end to succeed (perhaps this could be topic for discussion because my current understanding ofautoconf
is not overwhelming).CLASSPATH
environment variable points to the Hadoop native libraries. In my caseexport CLASSPATH=$($HOME/local/hadoop-2.7.6/bin/hadoop classpath --glob)
. This is not required for building, but is required for running.--with-hdfs
flag. In my case,/configure --prefix=$HOME/local/gdal-master --with-java=$JAVA_HOME --with-hdfs=$HOME/local/hadoop-2.7.6
.To Run
LD_LIBRARY_PATH
andCLASSPATH
must be as described above.$HOME/local/gdal-master/bin/gdalinfo /vsihdfs/file:/tmp/TC_NG_Baghdad_IQ_Geo.tif
The above command produces this output for me: