Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data browser with archive appliance data source doesn't show all available data #1769

Open
MichaelRitzert opened this issue Apr 26, 2016 · 26 comments

Comments

@MichaelRitzert
Copy link
Contributor

I frequently notice that the databrowser widget embedded in an OPI fails to display all archived data. It's always the time just before switching to live data where data is missing.

I never notice this when I right-click the widget and open in the full data browser using the same .plt file. See attached screenshots of roughly the same time period. The OPI has just been opened, i.e. live data is only at the very right of the display.

databrowser1
databrowser2

The archiver we use is the archiver appliance.

@kasemir
Copy link
Contributor

kasemir commented Apr 26, 2016

Check the data browser preference setting to "Automatically refresh historic data when the live sample buffer is full",

# Automatically refresh history data when the liver buffer is full
# This will prevent the horizontal lines in the shown data when the buffer
# is too small to cover the selected time range 
org.csstudio.trends.databrowser2/automatic_history_refresh=false

The data gap with the horizontal line between end of historic samples fetched from archive and the live sample buffer is created because at start of the data browser, it fetches archived data until "now". Then it adds live samples to the live data buffer. That live data buffer, however, has a limited size because if not you would eventually run out of memory.

When you look at the data browser widget inside a long running *.opi, that's likely to happen.
When you then start the normal data browser, it basically starts over with newly fetched archived data.

Fix 1)
Configure the live data buffer such that it holds enough samples for the width of the plot.
This can be configured for each trace.
Upside: You'll have the fill resolution live data for the trace.
Downside: You need to figure out how many samples you need, and you may use more memory.

Fix 2)
Enable that automatic_history_refresh option which will re-read historic data.
Upside: Easier to configure, just turn it on, no thinking required.
Downside: Archive access is usually slow, and you tend to get lower resolution data (MDEL vs. ADEL).

@kasemir
Copy link
Contributor

kasemir commented Apr 26, 2016

Other idea:
Increase the log level so you see the archive data requests.
Is the widget-version not requesting the full range of samples from the archive?

From the context menu, try to open the sample view to check if
a) There is indeed a gap in the samples
b) The archived data is all there, but it's not displayed??

@MichaelRitzert
Copy link
Contributor Author

It's not in long-running OPIs. I can open CSS and it happens right away. E.g. CSS is running for 1 minute, but the gap is 15 minutes. I still tried
org.csstudio.trends.databrowser2/automatic_history_refresh=true
but that didn't change anything.

I just had an occurrence of the problem in the full databrowser view. So it DOES happen, but much less frequently than in the widget. I checked, and it's missing the samples in the sample view also.

The log output says that one full hour of data is requested:
Apr 26, 2016 8:38:04 PM org.epics.archiverappliance.retrieval.client.RawDataRetrieval getDataForPV
INFO: URL to fetch data is http://archive:17668/retrieval/data/getData.raw?pv=ncount%28DESY2%3ACURR%3Acur%29&from=2016-04-26T19%3A38%3A04.936%2B02%3A00&to=2016-04-26T20%3A38%3A04.936%2B02%3A00
Apr 26, 2016 8:38:04 PM org.epics.archiverappliance.retrieval.client.RawDataRetrieval getDataForPV
INFO: URL to fetch data is http://archive:17668/retrieval/data/getData.raw?pv=DESY2%3ACURR%3Acur&from=2016-04-26T19%3A38%3A04.936%2B02%3A00&to=2016-04-26T20%3A38%3A04.936%2B02%3A00

The output from the archiver is fine. I downloaded the above URL with wget (replaced raw with csv), plotted with gnuplot, and it's perfect. Actually, I can just reopen the plot and there's a fair chance that it's fine the next time.

@kasemir
Copy link
Contributor

kasemir commented Apr 26, 2016

.. it's missing the samples in the sample view also ..
That's a useful clue.

We've had problems like #1443 where the plot showed a large gap because it got confused by some specific sequence of samples that included NaN.
But in your case it seems like the data has the gap, and the plot just reflects that.

So then the problem could be

  1. http://archive:17668/retrieval/data/getData.raw doesn't always return all the data. Often, but not always.
  2. org.epics.archiverappliance.retrieval.client.RawDataRetrieval somehow drops samples.
  3. org.csstudio.trends.databrowser2.archive.ArchiveFetchJob fails to completely merge the received samples.

I'm afraid you'll need to insert log/print statements in for example RawDataRetrieval to see what it gets.
The ArchiveFetchJob will log "Ended ... with 2134234 samples in .. seconds" for FINE log level, maybe that also helps to see if the number of received suddenly gets smaller when you see the gap.

@MichaelRitzert
Copy link
Contributor Author

I just ran the HTTP request (with raw output) 100 times, and it always came back with the same data (checked with sha256sum). While taking the 100 samples, I did see problems with the GUI, so it's also not about the load on the server. That makes 1) unlikely.

Do you know a fast way to increase the debug level for this logger only?

To add more debug statements, I have to change my workflow to compile everything locally instead of against the online p2s. I can have a look into that the day after tomorrow.

@kasemir
Copy link
Contributor

kasemir commented Apr 26, 2016

If your CSS product includes the logging.ui, you should have a menu entry CSS, Debugging, Logging Configuration where you can configure individual loggers.

@MichaelRitzert
Copy link
Contributor Author

OK, here's the relevant output. I happened to get a plot with hardly any data in it, and the output is
2016-04-27 09:43:36.863 FINE [Thread 77] org.csstudio.trends.databrowser2.archive.ArchiveFetchJob$WorkerThread (run) - Ended Read data: MARCO:TempAfterDet:TEMP:curr, 2016-04-27 08:43:36.819 +0200 - 2016-04-27 09:43:36.819 +0200 with 9 samples in 0.042 seconds
Normally, it's around 901 samples.

@MichaelRitzert
Copy link
Contributor Author

And I just acquired a simultaneous tcpdump of the traffic with the archiver and log output. The request was

GET /retrieval/data/getData.raw?pv=optimized_800%28MARCO%3ATempBeforeDet%3ATEMP%3Acurr%29&from=2016-04-27T08%3A52%3A36.850%2B02%3A00&to=2016-04-27T09%3A52%3A36.850%2B02%3A00 HTTP/1.1
(which is different from the one in the log message above), the log says

2016-04-27 09:52:37.480 FINE [Thread 74] org.csstudio.trends.databrowser2.archive.ArchiveFetchJob$WorkerThread (run) - Ended Read data: MARCO:TempBeforeDet:TEMP:curr, 2016-04-27 08:52:36.850 +0200 - 2016-04-27 09:52:36.850 +0200 with 167 samples in 0.619 seconds

but the tcpdump looks like way more samples were sent. I look at the dump with wireshark and without decoding the PBF, I see several hundred instances of a very similar pattern.

@kasemir
Copy link
Contributor

kasemir commented Apr 27, 2016

From your log info I'm pretty sure the problem is in the second step:

  1. http://archive:17668/retrieval/data/getData.raw doesn't always return all the data. Often, but not always.
  2. org.epics.archiverappliance.retrieval.client.RawDataRetrieval somehow drops samples.
  3. org.csstudio.trends.databrowser2.archive.ArchiveFetchJob fails to completely merge the received samples.

The ArchiveFetchJob does 'merge' data that it might receive from several archive data sources, plus it then later on ignores samples time stamped after the oldest 'live' sample.
But that count in the log message is simply the sum of all the received samples, before merging etc.

So "ArchiveFetchJob$WorkerThread (run) - Ended ...with 9 samples" should really mean that the archiverappliance.retrieval.client code only provided 9 samples. If the basic size of the network data returned by the getData.raw servlet appears to be much larger, the error must be in the archiverappliance.retrieval.client code.

I have no idea about the archive appliance, and didn't work on the archiverappliance.retrieval.client code. The network call with "pv=optimized_800.." looks like it's fetching optimized data, which in principle is good.
Maybe try configuring the channel to use original samples, and see if the problem only happens when the archiverappliance.retrieval.client code tries to decode optimized data, and not when it decodes raw data.

@kasemir kasemir changed the title embedded databrowser widget doesn't show all available data Data browser with archive appliance data source doesn't show all available data Apr 27, 2016
@MichaelRitzert
Copy link
Contributor Author

I just took the .plot file and replaced all OPTIMIZED with RAW.
The problem still occurs.

@kasemir
Copy link
Contributor

kasemir commented Apr 28, 2016

@jbobnar Jaka, do you have suggestions how to tell if the problem is in the pbrawclient org.epics.archiverappliance.retrieval.client) or the org.csstudio.archive.reader.appliance code?
Looks like one of them is returning only a fraction of the samples.

@jbobnar
Copy link
Contributor

jbobnar commented Apr 29, 2016

I've never seen this. I tried to reproduce it (on master branch), but it always worked OK. The fact that it always works in DataBrowser but not in OPIs is puzzling. If the problem was in the appliance code, you would probably see it in data browser at least from time to time.

Are you using a single server or a cluster with load balancer? Are all of your server instances of the same version? What about CSS version? Your previous comment (#1769 (comment)) suggests that your appliance supports "optimized" retrieval. However, in #1769 (comment) I see that CSS made a call using "ncount" operator, which suggests that your CSS might not be up to date (this was implemented about 9 months ago and is available on 4.2.x and master branches).

Can you check the log files on the server? Look for the file /retrieval/logs/localhost-2016...txt. It should contain all GET requests that the clients have made. The last number in each log entry is the number of bytes in the response of the appliance. Do you see any logs with significantly lower sizes? That would suggest that the appliance is sending back less points (only 9 samples).

@MichaelRitzert
Copy link
Contributor Author

Please note that it does also happen also in the databrowser, though less frequently.

It's a single appliance running the January version, all services on one server.

From the log, it seems all actual requests are using mean_4 reduction, requested period is 1h. When I check for that over a day, the sizes are between 15415 and 15455 (looking only at PVs that are noisy so should produce a sample virtually every second). This is consistent with my 100-request wget attempt.

CSS clients are 4.1 RCP (the live version), 4.2 RCP (just used by me to test with the latest version I could quickly build), 4.2 RAP, so indeed there might be a mix of versions producing above output. It happens just the same in all versions I tried.

Unfortunately, the system is going down at 13:30 CEST today (i.e. in less than 4 hours), so I probably won't be able to debug any more on the live system after that. I'll try to reproduce in my staging setup.

@berryma4
Copy link
Member

Do you have a proxy in between? I saw something similar with a squid proxy between the archive appliance and databrowser (any client).

@MichaelRitzert
Copy link
Contributor Author

No, direct access. I actually also tested from localhost.

@shroffk
Copy link
Member

shroffk commented May 20, 2016

This might really not be very useful information, but I have noticed the same at NSLS2 too. I have seen this even when there is no proxy or gateway in between.
The situation appears when the plot is first opened, there are segments of data that are missing, any action the re fetches the data (a refresh, moving the axis)...resolves the problem.

I have been trying to use my local image setup of the archiver to recreate the problem but not had much luck with reproducing it in my small test setup.

@shroffk
Copy link
Member

shroffk commented May 20, 2016

P.S I have noticed this problem with cs-studio 4.1.x

@shroffk
Copy link
Member

shroffk commented Jun 17, 2016

Update:
The problem has not been resolved in version 4.3.
I have certain plots which when opened have some part of their trace missing, the part and the trace vary with each attempt but there is always some missing data.
image

The data is definitely present cause any re-querying resolves the issues and fetching the data in other forms (I tested json) also works.

Solution:

I was able to resolve this Issue by adding additional synchronization around the optimized data query and the creation of the Iterator.

        DataRetrieval dataRetrieval = reader.createDataRetriveal(reader.getDataRetrievalURL());
        synchronized(lock){
            mainStream = dataRetrieval.getDataForPV(pvName, sqlStartTimestamp, sqlEndTimestamp);
        }

This is a static lock object, so I am guessing there is a race condition with some static resource needed to properly create the Iterator but looking at the RawDataRetrieval client and the InputStreamBackedGenMsg, I did not find any obvious problems.

However the lock does resolve the Issue and ensure that the InputStreamBackedGenMsg Iterator does not loose any data.

It would be nice if someone could verify this solution.

@kasemir
Copy link
Contributor

kasemir commented Jun 17, 2016

Kunal, thanks for looking into the archive appliance code to fix this issue in there!

The data browser will read data for all channels in the plot in parallel, so implementations of org.csstudio.archive.reader.ArchiveReader must be multithreaded.

Specifically, if there are N channels, the ArchiveReaderFactory provided by the data source will be asked for N ArchiveReaders, and those will then each be called to read data for a channel.
Each ArchiveReader is only reading data for one channel at a time, but there can be N readers running in parallel.

@shroffk
Copy link
Member

shroffk commented Jun 17, 2016

My lock is in the ApplianceValueIterator class, @kasemir is that alright?
Did you mean that the ArchoverReader should be threadsafe.

I also tried writing a stress test for the RawDataRetrieval client - but it works fine.

    public static class MyCall implements Callable<GenMsgIterator> {
        private final RawDataRetrieval retrival;
        private final String pvName;
        private final Timestamp startTime;
        private final Timestamp endTime;

        protected MyCall(RawDataRetrieval retrival, String pvName, Timestamp startTime, Timestamp endTime) {
            super();
            this.retrival = retrival;
            this.pvName = pvName;
            this.startTime = startTime;
            this.endTime = endTime;
        }

        @Override
        public GenMsgIterator call() throws Exception {
            return retrival.getDataForPV(pvName, startTime, endTime);
        }

    }

    public Map<String, GenMsgIterator> multipleConcurrentFetch() {
        Timestamp start = TimestampHelper.toSQLTimestamp(ZonedDateTime.parse("2016-06-06T15:08:56.381Z").toInstant());
        Timestamp end = TimestampHelper.toSQLTimestamp(ZonedDateTime.parse("2016-06-16T16:08:56.381Z").toInstant());

        ExecutorService ex = Executors.newFixedThreadPool(5);

        Map<String, GenMsgIterator> result = new HashMap<String, GenMsgIterator>();
        Map<String, Future<GenMsgIterator>> calls = new HashMap<>();

        try {
            for (String pvName : pvNames) {
                RawDataRetrieval retrival = new RawDataRetrieval("http://localhost:9300/retrieval/data/getData.raw");
                calls.put(pvName, ex.submit(new MyCall(retrival, pvName, start, end)));
            }
            for (Entry<String, Future<GenMsgIterator>> future : calls.entrySet()) {
                result.put(future.getKey(), future.getValue().get());
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        return result;
    }

@kasemir
Copy link
Contributor

kasemir commented Jun 17, 2016

Did you mean that the ArchoverReader should be threadsafe.

The ArchiveReader is an interface.
I'm saying that the implementation of the ArchiveReader needs to be thread safe, because the data browser will create multiple ArchiveReaders, one for each channel in the plot.

org.csstudio.archive.reader.applicance.ApplianceArchiveReader is the implementation for the appliance.
As far as I understand this, you found that something inside that code isn't thread safe. It only works when you add a global lock. That's OK with me. It does mean that the concurrent calls to N archive readers will effectively become serialized because of the global lock, but that doesn't matter to the data browser. It just displays the data for a channel whenever the associated reader returns.

@shroffk
Copy link
Member

shroffk commented Jun 17, 2016

Sorry I was referring to the ApplianceArchiverReader when asking if it should be threadsafe.

I have tried to keep the lock as narrow as possible.
Since in my example I can use the retrieval in a threadsafe manner I will try and see if I can get the same behavior without needing a global lock...

@berryma4 berryma4 mentioned this issue Jun 20, 2016
22 tasks
@berryma4
Copy link
Member

@shroffk your patch works for me. I should mention #1274 here. This was noticed then, and it was assumed the web proxy had something to do with it. It seems a web proxy just makes this threading issue worse. I can recreate the lost data with a web proxy, and your code snippet fixes that too.
👍

Although, looking this over, it seems this could be cleaned up a bit more. Project?

@shroffk
Copy link
Member

shroffk commented Jun 20, 2016

I think my synchronization is addressing a symptom as opposed to the actual problem...but at least we don't have missing data.

@jbobnar
Copy link
Contributor

jbobnar commented Jun 21, 2016

@shroffk
I checked the pbrawclient.jar code, but also didn't see anything that could cause synchronization issues. It could also be a server problem.
In the test you could also try one of the postprocessors instead of the pv name. For example: optimized_1000(pvName) or mean_100(pvName)

@shroffk
Copy link
Member

shroffk commented Jun 21, 2016

I am testing with postprocessors...the test I ran was with optimized_800()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants