Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break up large read requests into smaller, pipelined requests. #20125

Merged
merged 3 commits into from Aug 12, 2017

Conversation

bbockelm
Copy link
Contributor

This breaks up any read request over 8MB into a series of reads that are 8MB or smaller. In order to avoid network latency induced stalls, we pipeline two requests at a time.

The intent is to prevent large read requests (such as the 128MB ones used by lazy-download) from hitting the per-operation timeout.

This breaks up any read request over 8MB into a series of reads that
are 8MB or smaller.  In order to avoid network latency induced stalls,
we pipeline two requests at a time.
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @bbockelm (Brian Bockelman) for CMSSW_9_2_X.

It involves the following packages:

Utilities/XrdAdaptor

@cmsbuild, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @wddgit this is something you requested to watch as well.
@davidlange6 you are the release manager for this.

cms-bot commands are listed here

@davidlange6
Copy link
Contributor

hi @bbockelm - please make a master branch request too.

@davidlange6
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 11, 2017

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/22210/console Started: 2017/08/11 08:02

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-20125/22210/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 22
  • DQMHistoTests: Total histograms compared: 1791740
  • DQMHistoTests: Total failures: 44267
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 1747307
  • DQMHistoTests: Total skipped: 166
  • DQMHistoTests: Total Missing objects: 0
  • Checked 90 log files, 14 edm output root files, 22 DQM output files

@slava77
Copy link
Contributor

slava77 commented Aug 11, 2017

urgent

T0 wanted this in the release, as mentioned in the OPS meeting today
@drkovalskyi

@Dr15Jones
Copy link
Contributor

Is there any way to test that this change actually helps with the problem in the T0?

// In some cases, the IO layers above us (particularly, if lazy-download is
// enabled) will emit very large reads. We break this up into multiple
// reads in order to avoid hitting timeouts.
std::vector<IOPosBuffer> requests;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also do a reserve call here by using the integer division if n and XRD_CL_MAX_READ_SIZE


uint32_t bytesRead = m_requestmanager->handle(into, n, pos).get();
std::vector<std::pair<std::future<IOSize>, IOSize>> futures; futures.reserve(requests.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use two lines

bool readReturnedShort = false;
for (auto &future : futures) {
// Future throws an exception on failure.
IOSize result = future.first.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a second loop and not just done at line 281? In fact, I don't think there is a need for the futures container at all. Looks like you just need a handle on two of them, a present and a next.

Copy link
Contributor Author

@bbockelm bbockelm Aug 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this would help the readability of the code -- indeed, it can be collapsed together (and only two futures are necessary).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I came up with a clean / readable way to do this without maintaining a list. Will update in a moment.

@sextonkennedy
Copy link
Member

@Dr15Jones the testing will have to be done in a replay of the tier0 with this release. Dirk and Brian have been debating the strategy of this in the ticket and the need to reduce the read size was also requested by a CERN IT developer. I appreciate your C++ review and comments, but once Brian addresses them I think we have to move forward. The tier0 is paused right now awaiting the new release.

@cmsbuild
Copy link
Contributor

Pull request #20125 was updated. @cmsbuild, @smuzaffar, @Dr15Jones can you please check and sign again.

@smuzaffar
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 11, 2017

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/22226/console Started: 2017/08/11 16:41

cur_future_expected = chunk;

// Wait for the prior read; update bytesRead.
check_read(prev_future, prev_future_expected);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this works the first time through the loop because prev_future.valid() is false and check_read drops out immediately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

@Dr15Jones
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_9_2_X IBs after it passes the integration tests and once validation in the development release cycle CMSSW_9_3_X is complete. This pull request will now be reviewed by the release team before it's merged. @davidlange6, @smuzaffar (and backports should be raised in the release meeting by the corresponding L2)

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-20125/22226/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 22
  • DQMHistoTests: Total histograms compared: 1792872
  • DQMHistoTests: Total failures: 29342
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 1763364
  • DQMHistoTests: Total skipped: 166
  • DQMHistoTests: Total Missing objects: 0
  • Checked 90 log files, 14 edm output root files, 22 DQM output files

@sextonkennedy
Copy link
Member

I want to share with this thread the notes from the meeting with the storage operations team that occurred last Mon. and which there will be a repeat of next Mon. Issue no. 2 is probably the scariest as there is not way to recover RAW data that gets lost and P5 has already cleaned up. This is why I'm willing to push for a test in a replay only short cut. The source of large read requests comes from lazy downloading, and only the tier0 should still be using that (RAL was recently convinced to stop doing that).

Hi all,

here the minutes from yesterday's meeting, thanks to Jan for putting
down the notes!
Please let us know if we missed/forgot something

Cheers,
Luca

============================================================

EOSCMS crisis meeting 2017-08-09, triggered by GGUS#129607 (alarm)

present: Christoph (remote), Hervé, Elvin, Dima, Zeynep, John, Luca, Jan

CMS sees way too many issues with EOSCMS, manpower-intensive to the point
that they are considering alternatives (run outside of CERN, stream
initially to CASTOR, ..).
CMS load should be comparable to last year (but with bigger files).
P5 is severely limited on storage and needs to delete files soon after
EOS has acknowledged
successful transfer.

Main issues, in decreasing priority:

  1. "disappearing files": files get written (OK), checked (OK), then go
    away. affect raw data files.
  2. "0-size files": files get written(OK), checked (OK), then namespace
    shows them to be truncated
  3. "disappearing directories": (recent, discovered by accident -
    potentially huge impact?)
  4. "Machine not on the network": annoying background error rate

CMS has implemented various workarounds (disable client-side write
recovery, "eoscp -x"),
still see errors that were supposed to be fix.


  1. "disappearing files"

Bug in client timeout+retry logic - a first attempt got stalled, a retry
went OK, eventually the stalled attempt got treated, caused an error,
and cleanup-on-error removed the file.

EOS ops/devs believe this should no longer occur after the EOSCMS MGM
update 2017-08-07 19:00 (0.3.265: workaround in place that no longer
"cleanup" in case replicas
exists - this is a server-side workaround).
Anything after that time needs investigation - please report.

CMS T0 jobs should "forget" about these files after 12h, so once fixed
would expect a quick drop of error messages. However, external transfers
might still want to access these files much later (and get errors).

  1. "0-size files"

Also assumed to be due to a client-side Xrootd internal retry (where a
second attempt just "truncates" the file. The actual file content is
still on EOS, but the size mismatch causes the files to no longer
readable (which partially contributes to "Machine not on the network"-errors)

EOS ops/devs believe(d) this should no longer occur when
"XRD_WRITERECOVERY=0" is set. This has been done for T0 "agent" (date?),
and for StorageManager (2017-08-08)

CMS has seen this still afterwards - will give fresh examples.

These files need to be "recovered" as much as possible (raw data, no
longer at P5) by EOS ops.

  1. "disappearing directories"

At least one known bug that get triggered during namespace "compaction"
(which happened 2017-08-07 ~18:00). Directories can be manually
recovered if missing, and a cold restart of the namespace should bring
all of them back.
-> Next update should include such a cold restart. CMS OK with the
associated downtime (<1h).

  1. "Machine not on the network"

ongoing investigation, possibly linked to use of 128MB prefetch but seen
by 3 different job classes. Lower priority. No recent CMS-side changes.

  1. "response mixup" (servers answers for a different file than request)

Possible conection to a known xrootd client bug that gets triggered under load,
recommendation was to update to 4.6.
CMS challenges this (old client was in use for a year, but errors are
recent; client-side load (streams,
jobs, CPU) is unchanged) but will update tomorrow.

Any pattern? seems to only affect transfers from "glidein"?
CMS explains that 90% of transfers would have this (based on CPU
allocations), rest is PheDex (which uses GridFTP, which internally
speaks Xrootd)

@slava77
Copy link
Contributor

slava77 commented Aug 12, 2017

merge

following the request/confirmation from @sextonkennedy

@slava77
Copy link
Contributor

slava77 commented Aug 12, 2017

I guess my magic power is gone

@smuzaffar
Copy link
Contributor

@slava77 , I have added you as special release manager. cms-bot should recognize your merge request
cms-sw/cms-bot@51357c8#diff-61ad223fc9fb3b45ce3e5cdac2916918

@cmsbuild cmsbuild merged commit 69bf76d into cms-sw:CMSSW_9_2_X Aug 12, 2017
@bbockelm
Copy link
Contributor Author

@sextonkennedy - to be clear, I believe this will significantly address many known causes of [4].

[5] may be a bug in the upstream xrootd client (although it's hard to determine if the server is getting confused in responding or if the client is getting confused in parsing the response). [1 - 3] are likely EOS-specific issues.

@davidlange6
Copy link
Contributor

So the reason to push this at high priority in won't be fixed by the pr...interesting.

@bbockelm
Copy link
Contributor Author

Well, it will fix the ALARM ticket (https://ggus.eu/index.php?mode=ticket_info&ticket_id=129607). I just don't see how it would fix the server-side issues.

@sextonkennedy
Copy link
Member

sextonkennedy commented Aug 12, 2017 via email

@bbockelm
Copy link
Contributor Author

Hi Liz,

I didn't do too deep an analysis here; [2] could certainly be in the same category as [5]. The EOS folks have probably investigated deeper on this issue than I have. Regardless, it wouldn't be affected by this PR.

Is there a GGUS ticket for this? If we want to dig deeper from the client side, it'd probably be good to catch me up on the issue (additionally, not analyze a separate bug on a closed PR...).

Brian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants