Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split >5GB Files automatically #21

Closed
wants to merge 19 commits into from
Closed

Conversation

djalova
Copy link
Contributor

@djalova djalova commented Mar 31, 2016

Developer's Certificate of Origin 1.1

   By making a contribution to this project, I certify that:

   (a) The contribution was created in whole or in part by me and I
       have the right to submit it under the Apache License 2.0; or

   (b) The contribution is based upon previous work that, to the best
       of my knowledge, is covered under an appropriate open source
       license and I have the right under that license to submit that
       work with modifications, whether created in whole or in part
       by me, under the same open source license (unless I am
       permitted to submit under a different license), as indicated
       in the file; or

   (c) The contribution was provided directly to me by some other
       person who certified (a), (b) or (c) and I have not modified
       it.

   (d) I understand and agree that this project and the contribution
       are public and that a record of the contribution (including all
       personal information I submit with it, including my sign-off) is
       maintained indefinitely and may be redistributed consistent with
       this project or the open source license(s) involved.

@djalova
Copy link
Contributor Author

djalova commented Mar 31, 2016

We cannot guarantee that a single partition will always be under Swift's 5GB size limit. When a single partition reaches the 5GB size limit, this patch will partition the file itself by opening a new stream and write the rest of the contents in another object named "FileName/part-xxxx-split-xxxx-attempt..."

@gilv Can you review this?

@gilv
Copy link
Contributor

gilv commented Apr 1, 2016

@djalova Sure, i will check it. Thanks. Interesting idea to split files this way.

@fraPace
Copy link
Contributor

fraPace commented Apr 1, 2016

@djalova
I believe hard coding the limit is not the correct idea.
The default limit on Swift is indeed 5GB, but that might not be the case. The Swift admin might have chosen to use a lower limit for some particular reason, and your code will fail.

The correct way, in my opinion would be to perform (at the creation of Stocator instance) a query to get the Swift cluster capabilities (http://developer.openstack.org/api-ref-objectstorage-v1.html#infoDiscoverability) and thus knowing the maximum size allowed per object.

@gilv What do you think?

@gilv
Copy link
Contributor

gilv commented Apr 1, 2016

@Nosfe @djalova I agree, but i think we should make it configurable via configuration.
I think more important is to understand wether we should use approach as was suggested by @djalova or use standard approach of SLO with manifest. I have to admit, that my first impression that @djalova used approach which looks much better than using manifest for SLO or DLO

@fraPace
Copy link
Contributor

fraPace commented Apr 4, 2016

@gilv @djalova
I agree with you, but I would go a little bit further. The logical flow that I propose is the following:

  1. Stocator set a default value of 5GB
  2. Stocator queries the Object maximum file allowed by the Swift that will be used
    2.1) If the query provide some results, override the default value
  3. Stocator looks for the configurable parameter
    3.1) If it exist, check if it is not greater (if so, error message), and override the previous value

In this way the possibility of errors due to miss configuration is almost absent.

What do you think?

@djalova
Copy link
Contributor Author

djalova commented Apr 4, 2016

@Nosfe @gilv
I agree with your points. I already added a commit to make the object size configurable. I just added a commit to check if the value from the configuration is valid.

@gilv
Copy link
Contributor

gilv commented Apr 4, 2016

@djalova I think it's very good patch :) much better then my original idea by using SLO and manifest.
I didn't had time to test it yet. Did you by chance tested it with real object that writes a single part more then 5 GB?

@djalova
Copy link
Contributor Author

djalova commented Apr 4, 2016

@gilv Yes, @jasoncl and I have tested writing with a single file of 6GB and 10GB with a max object size of 5GB. We've also tested writing 100MB files with 10MB max sizes. In all cases we tried reading from the split files, and the reads work without any loss of data.

@djalova
Copy link
Contributor Author

djalova commented Apr 7, 2016

@gilv Have you had time to test this out yet? I think I'm done with the changes I want to add. Let me know if you think there needs to be more done to get this merged. Thanks.

@gilv
Copy link
Contributor

gilv commented Apr 9, 2016

@djalova It looks good. I just need to perform couple of additional tests and had no time so far. Will try to do it during the next week.

}

@Override
public void close() throws IOException {
LOG.info("{} bytes written", totalBytesWritten);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djalova I guess we don't want it at info level, otherwise it will be printed all the time. Can you remove it please? I recently try to reduce number of debug prints. If you like, you can make it at trace level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that makes sense.

@gilv
Copy link
Contributor

gilv commented Apr 15, 2016

@djalova It looks good, i just saw the code and left one comment. I will later also try to run the code. Can you please add some unitest to it or a functional test? It's not mandatory, but would be good to have.

@gilv
Copy link
Contributor

gilv commented Apr 15, 2016

@djalova Can you please rebase this branch, it seems it has conflicts with master branch

@djalova
Copy link
Contributor Author

djalova commented Apr 15, 2016

@gilv I rebased and added your suggestion. I'll work on the test and push when it's done.

@gilv
Copy link
Contributor

gilv commented Apr 17, 2016

@djalova I think it's very good, but we need to resolve the resiliency issues that this patch adds.

As example, assume SF311.csv/part-00000-attempt_201604171048_0000_m_000000_0 is written. If the task is failed there will be new additional(s) attempts, like
SF311.csv/part-00000-attempt_201604171048_0000_m_000000_1
SF311.csv/part-00000-attempt_201604171048_0000_m_000000_2
SF311.csv/part-00000-attempt_201604171048_0000_m_000000_3
etc..
Assume that job completed successfully.

The list() method will pick up the correct part-0000, based on the size ( the large size is the winner ). For example it may choose SF311.csv/part-00000-attempt_201604171048_0000_m_000000_2 and ignore attempt 0 and 3. The resolution uses the fact that it's the same object name "part-0000"

Adding this patch will affect the way list works, since you modify part-ID to part-ID-split
For example, if task will fail after split-0008

SF311.csv/part-00000-attempt_201604171048_0000_m_000000_0
SF311.csv/part-00000-split-00001-attempt_201604171048_0000_m_000000_0
SF311.csv/part-00000-split-00002-attempt_201604171048_0000_m_000000_0
SF311.csv/part-00000-split-00003-attempt_201604171048_0000_m_000000_0
SF311.csv/part-00000-split-00004-attempt_201604171048_0000_m_000000_0
SF311.csv/part-00000-split-00005-attempt_201604171048_0000_m_000000_0
SF311.csv/part-00000-split-00006-attempt_201604171048_0000_m_000000_0
SF311.csv/part-00000-split-00007-attempt_201604171048_0000_m_000000_0
SF311.csv/part-00000-split-00008-attempt_201604171048_0000_m_000000_0

and there will be replacement task "1"

SF311.csv/part-00000-attempt_201604171048_0000_m_000000_1
SF311.csv/part-00000-split-00001-attempt_201604171048_0000_m_000000_1
SF311.csv/part-00000-split-00002-attempt_201604171048_0000_m_000000_1
SF311.csv/part-00000-split-00003-attempt_201604171048_0000_m_000000_1
SF311.csv/part-00000-split-00004-attempt_201604171048_0000_m_000000_1
SF311.csv/part-00000-split-00005-attempt_201604171048_0000_m_000000_1
SF311.csv/part-00000-split-00006-attempt_201604171048_0000_m_000000_1
SF311.csv/part-00000-split-00007-attempt_201604171048_0000_m_000000_1
SF311.csv/part-00000-split-00008-attempt_201604171048_0000_m_000000_1

then the current code will fail to identify correct attempt, since part-0000 is modified to part-0000-split-number.

I think to resolve this we just need to modify list method, so it will pick up "part-NUMBER" and not "part-NUMBER-SPLIT"

@djalova
Copy link
Contributor Author

djalova commented Apr 18, 2016

@gilv Are you referring to the String returned by nameWithoutTaskID()? I added some debug messages and it includes the "split-xxxxx" without adding any extra code.

@gilv
Copy link
Contributor

gilv commented Apr 19, 2016

@djalova i think this is exact issue here - the name should not contain "split"..

@gilv
Copy link
Contributor

gilv commented Apr 19, 2016

@djalova The algorithm in list() method should be adapted as i wrote in my previous remark. Otherwise it will not work. I can try to adapt it.

@djalova
Copy link
Contributor Author

djalova commented Apr 19, 2016

@gilv Is this so that if any of the split uploads fails, we should start over and look for a part-00000 with a different a attempt number? I assumed that since the listing is alphabetical this wouldn't be a problem. For example if we had 2 attempts, A & B it would be listed:
part-0000-attemptA
part-0000-attemptB
part-0000-split-0001-attemptA
part-0000-split-0001-attemptB
Then if "part-0000-split-0001-attemptA" fails, we will catch it when it is compared to "part-0000-split-0001-attemptB" for collisions.

@djalova
Copy link
Contributor Author

djalova commented May 18, 2016

@gilv I don't think there's an issue with the list because when it checks for collisions the split number is included. Also since we go through the list alphabetically, when we check for collisions it compares objects of the same part and split number when there is a failed attempt.

@gilv
Copy link
Contributor

gilv commented May 20, 2016

@djalova Spit logic is internal, and the Spark's task is not aware of it. Here is an example
Task1 writes data and it split internally to part1-attempt-1-split-1, part1-attempt-1-split-2, part1-attempt-1-split-3
Consider a replacement task , that will generate part1-attempt-2-split-1, part1-attempt-2-split-2, part1-attempt-2-split-3

the list algorithm will not work in this case.

@djalova
Copy link
Contributor Author

djalova commented May 20, 2016

@gilv The naming scheme is part-#-split-#-attempt#. In the list logic, everything after the last '-' is stripped so when the objects are compared we compare part-#-split#. Do we want to make sure that we grab parts and splits from the same attempt? Or do we just care about the part and split number?

@khanderao
Copy link

Hello @djalova : when is this fix planned to get merged?

@djalova
Copy link
Contributor Author

djalova commented Aug 17, 2016

Hi @khanderao
The code in master has a changed a bit since I opened this PR. I'll have this updated by tomorrow.

@gilv
Copy link
Contributor

gilv commented Aug 17, 2016

@khanderao do you have a use case when a single Spark task write more then 4GB of data?

@djalova djalova closed this Aug 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants