Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement unbuffer interface for HdfsFileInputStream #16017

Merged
merged 1 commit into from
Feb 23, 2023

Conversation

qian0817
Copy link
Contributor

@qian0817 qian0817 commented Aug 5, 2022

What changes are proposed in this pull request?

Implement unbuffer interface for HdfsFileInputStream. Fix #16016.

Why are the changes needed?

If the unbuffer method is not implemented, then impala will not be able to use the file handle cache.

Does this PR introduce any user facing changes?

Implement CanUnbuffer and StreamCapabilities for HdfsFileInputStream.

Copy link
Contributor

@Jackson-Wang-7 Jackson-Wang-7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not familiar with unbuffer. Firstly I want to know how and when does the client call the unbuffer method. Secondly I want to know how it affect the performance and what the optimized benefits are.
And I need to make sure it doesn't affect other scenes, please give me some more information and test examples.
Thanks!

@qian0817
Copy link
Contributor Author

qian0817 commented Dec 5, 2022

@Jackson-Wang-7 Thanks for you review. I rebase the master branch and commit our latest changes.

@Jackson-Wang-7
Copy link
Contributor

Mostly LGTM, and I suggest you’d better find one more person to take a look.

@jiacheliu3
Copy link
Contributor

/ping @jja725 PTAL thanks!

Copy link
Contributor

@jja725 jja725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. I've discussed this with @beinan and we're good with the new shaded method. Bring in @LuQQiu to see whether they are OK with maintaining two different ways of shaded.

<parent>
<groupId>org.alluxio</groupId>
<artifactId>alluxio-core-client</artifactId>
<version>2.10.0-SNAPSHOT</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Xenorith do we need to update the script for this new pom file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need, the script looks for the version string in all pom.xml files:

function update_poms() {
    find . -name pom.xml | xargs -t -n 1 perl -pi -e "s/${1}/${2}/g"
}

<artifactId>alluxio-core-client-hdfs3</artifactId>
<packaging>jar</packaging>
<name>Alluxio Core - Client - HDFS3</name>
<description>HDFS Client of Alluxio Core For HDFS 3</description>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please help us update the doc regarding the hdfs3 client

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Let's align first internally then comment on this PR on the recommended doc change.

Xenorith added a commit to Xenorith/alluxio that referenced this pull request Jan 3, 2023
the new client/hdfs3 and shaded/client-hadoop3  modules are currently a copy of the existing client/hdfs and shaded/client modules

the addition of this will allow for changes that are available only in hadoop3, such as Alluxio#16017 (comment)

both client jars will be built by default, but the symlink at client/alluxio-VERSION-client.jar will point to the hadoop-2 one to maintain backcompat. if the hadoop-3 profile is activated by adding `-Phadoop-3`, then the symlink will be overridden to point to the new hadoop3 shaded client jar

note that in the current state of this PR, the tarball generation will result in pointing to the hadoop3 shaded client jar because the hadoop-3 profile is activated by default
alluxio-bot pushed a commit that referenced this pull request Jan 18, 2023
the new client/hdfs3 and shaded/client-hadoop3 modules are currently a
copy of the existing client/hdfs and shaded/client modules

the addition of this will allow for changes that are available only in
hadoop3, such as
#16017 (comment)

both client jars will be built by default, but the symlink at
client/alluxio-VERSION-client.jar will point to the hadoop-2 one to
maintain backcompat. if the hadoop-3 profile is activated by adding
`-Phadoop-3`, then the symlink will be overridden to point to the new
hadoop3 shaded client jar

pr-link: #16699
change-id: cid-a6ffd09414e8259078fd5a8f68c3e287d85feec5
@Xenorith
Copy link
Contributor

#16699 has merged to unblock the maven/pom parts of this change

@jiacheliu3
Copy link
Contributor

@qian0817 Do you mind rebasing on latest master? Let's only keep the unbuffer change in this PR, while we rely on the #16699 to build the clients we want. Feel free to let me know if there's any question, thanks!

jja725 pushed a commit to jja725/alluxio that referenced this pull request Jan 27, 2023
the new client/hdfs3 and shaded/client-hadoop3 modules are currently a
copy of the existing client/hdfs and shaded/client modules

the addition of this will allow for changes that are available only in
hadoop3, such as
Alluxio#16017 (comment)

both client jars will be built by default, but the symlink at
client/alluxio-VERSION-client.jar will point to the hadoop-2 one to
maintain backcompat. if the hadoop-3 profile is activated by adding
`-Phadoop-3`, then the symlink will be overridden to point to the new
hadoop3 shaded client jar

pr-link: Alluxio#16699
change-id: cid-a6ffd09414e8259078fd5a8f68c3e287d85feec5
@qian0817
Copy link
Contributor Author

@jiacheliu3 @jja725 @Jackson-Wang-7 @dbw9580 Please revirwe this PR again, thanks!

@Jackson-Wang-7
Copy link
Contributor

Jackson-Wang-7 commented Feb 15, 2023

Is there any new change from my LGTM last time except the pom change?

@qian0817
Copy link
Contributor Author

Is there any new change from my LGTM last time except the pom change?
No change.

Copy link
Contributor

@Jackson-Wang-7 Jackson-Wang-7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jja725
Copy link
Contributor

jja725 commented Feb 15, 2023

Just curious is there any other compute framework use this function except impala?

@qian0817
Copy link
Contributor Author

From HADOOP-14747 we can know that HBase also use the unbffer function

Copy link
Contributor

@jiacheliu3 jiacheliu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!

Comment on lines +537 to +556
byte[] cacheMiss = new byte[partialReadSize];
stream.unbuffer();
stream.seek(offset);
stream.unbuffer();
Assert.assertEquals(partialReadSize, stream.read(cacheMiss));
stream.unbuffer();
Assert.assertArrayEquals(
Arrays.copyOfRange(testData, offset, offset + partialReadSize), cacheMiss);
Assert.assertEquals(0, manager.mPagesServed);
Assert.assertEquals(1, manager.mPagesCached);

byte[] cacheHit = new byte[partialReadSize];
stream.unbuffer();
stream.seek(offset);
stream.unbuffer();
Assert.assertEquals(partialReadSize, stream.read(cacheHit));
stream.unbuffer();
Assert.assertArrayEquals(
Arrays.copyOfRange(testData, offset, offset + partialReadSize), cacheHit);
Assert.assertEquals(1, manager.mPagesServed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add some more comments in the test explaining the behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is essentially the same as readPartialPage, which is to test that unbffer method does not affect the read behavior.

Copy link
Contributor

@dbw9580 dbw9580 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jja725 jja725 self-requested a review February 22, 2023 19:54
Copy link
Contributor

@jja725 jja725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jiacheliu3
Copy link
Contributor

alluxio-bot, merge this please

@alluxio-bot alluxio-bot merged commit 903269f into Alluxio:master Feb 23, 2023
@qian0817 qian0817 deleted the unbuffer branch February 23, 2023 02:02
@Xenorith Xenorith added the area-client Alluxio client label Feb 28, 2023
YangchenYe323 pushed a commit to YangchenYe323/alluxio that referenced this pull request Apr 16, 2023
### What changes are proposed in this pull request?

Implement unbuffer interface for HdfsFileInputStream. Fix Alluxio#16016.

### Why are the changes needed?

If the unbuffer method is not implemented, then impala will not be able
to use the file handle cache.

### Does this PR introduce any user facing changes?

Implement CanUnbuffer and StreamCapabilities for HdfsFileInputStream.

pr-link: Alluxio#16017
change-id: cid-b50163c7b4f199b8a61d5818a0e4739039f2745c
jiacheliu3 pushed a commit to jiacheliu3/alluxio that referenced this pull request May 15, 2023
…ed yet: Add client-hadoop3 module

the new client/hdfs3 and shaded/client-hadoop3 modules are currently a
copy of the existing client/hdfs and shaded/client modules

the addition of this will allow for changes that are available only in
hadoop3, such as
Alluxio#16017 (comment)

both client jars will be built by default, but the symlink at
client/alluxio-VERSION-client.jar will point to the hadoop-2 one to
maintain backcompat. if the hadoop-3 profile is activated by adding
`-Phadoop-3`, then the symlink will be overridden to point to the new
hadoop3 shaded client jar

pr-link: Alluxio#16699
change-id: cid-a6ffd09414e8259078fd5a8f68c3e287d85feec5
jiacheliu3 pushed a commit to jiacheliu3/alluxio that referenced this pull request May 15, 2023
…ed yet: Add client-hadoop3 module

the new client/hdfs3 and shaded/client-hadoop3 modules are currently a
copy of the existing client/hdfs and shaded/client modules

the addition of this will allow for changes that are available only in
hadoop3, such as
Alluxio#16017 (comment)

both client jars will be built by default, but the symlink at
client/alluxio-VERSION-client.jar will point to the hadoop-2 one to
maintain backcompat. if the hadoop-3 profile is activated by adding
`-Phadoop-3`, then the symlink will be overridden to point to the new
hadoop3 shaded client jar

pr-link: Alluxio#16699
change-id: cid-a6ffd09414e8259078fd5a8f68c3e287d85feec5
jiacheliu3 pushed a commit to jiacheliu3/alluxio that referenced this pull request May 16, 2023
…dfsFileInputStream

Implement unbuffer interface for HdfsFileInputStream. Fix Alluxio#16016.

If the unbuffer method is not implemented, then impala will not be able
to use the file handle cache.

Implement CanUnbuffer and StreamCapabilities for HdfsFileInputStream.

pr-link: Alluxio#16017
change-id: cid-b50163c7b4f199b8a61d5818a0e4739039f2745c
jiacheliu3 pushed a commit to jiacheliu3/alluxio that referenced this pull request May 16, 2023
…dfsFileInputStream

Implement unbuffer interface for HdfsFileInputStream. Fix Alluxio#16016.

If the unbuffer method is not implemented, then impala will not be able
to use the file handle cache.

Implement CanUnbuffer and StreamCapabilities for HdfsFileInputStream.

pr-link: Alluxio#16017
change-id: cid-b50163c7b4f199b8a61d5818a0e4739039f2745c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Change Changes covering public API area-client Alluxio client priority-high
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add unbuffer support
9 participants