New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data browser with archive appliance data source doesn't show all available data #1769
Comments
Check the data browser preference setting to "Automatically refresh historic data when the live sample buffer is full",
The data gap with the horizontal line between end of historic samples fetched from archive and the live sample buffer is created because at start of the data browser, it fetches archived data until "now". Then it adds live samples to the live data buffer. That live data buffer, however, has a limited size because if not you would eventually run out of memory. When you look at the data browser widget inside a long running *.opi, that's likely to happen. Fix 1) Fix 2) |
Other idea: From the context menu, try to open the sample view to check if |
It's not in long-running OPIs. I can open CSS and it happens right away. E.g. CSS is running for 1 minute, but the gap is 15 minutes. I still tried I just had an occurrence of the problem in the full databrowser view. So it DOES happen, but much less frequently than in the widget. I checked, and it's missing the samples in the sample view also. The log output says that one full hour of data is requested: The output from the archiver is fine. I downloaded the above URL with wget (replaced raw with csv), plotted with gnuplot, and it's perfect. Actually, I can just reopen the plot and there's a fair chance that it's fine the next time. |
We've had problems like #1443 where the plot showed a large gap because it got confused by some specific sequence of samples that included NaN. So then the problem could be
I'm afraid you'll need to insert log/print statements in for example RawDataRetrieval to see what it gets. |
I just ran the HTTP request (with raw output) 100 times, and it always came back with the same data (checked with sha256sum). While taking the 100 samples, I did see problems with the GUI, so it's also not about the load on the server. That makes 1) unlikely. Do you know a fast way to increase the debug level for this logger only? To add more debug statements, I have to change my workflow to compile everything locally instead of against the online p2s. I can have a look into that the day after tomorrow. |
If your CSS product includes the logging.ui, you should have a menu entry CSS, Debugging, Logging Configuration where you can configure individual loggers. |
OK, here's the relevant output. I happened to get a plot with hardly any data in it, and the output is |
And I just acquired a simultaneous tcpdump of the traffic with the archiver and log output. The request was GET /retrieval/data/getData.raw?pv=optimized_800%28MARCO%3ATempBeforeDet%3ATEMP%3Acurr%29&from=2016-04-27T08%3A52%3A36.850%2B02%3A00&to=2016-04-27T09%3A52%3A36.850%2B02%3A00 HTTP/1.1 2016-04-27 09:52:37.480 FINE [Thread 74] org.csstudio.trends.databrowser2.archive.ArchiveFetchJob$WorkerThread (run) - Ended Read data: MARCO:TempBeforeDet:TEMP:curr, 2016-04-27 08:52:36.850 +0200 - 2016-04-27 09:52:36.850 +0200 with 167 samples in 0.619 seconds but the tcpdump looks like way more samples were sent. I look at the dump with wireshark and without decoding the PBF, I see several hundred instances of a very similar pattern. |
From your log info I'm pretty sure the problem is in the second step:
The ArchiveFetchJob does 'merge' data that it might receive from several archive data sources, plus it then later on ignores samples time stamped after the oldest 'live' sample. So "ArchiveFetchJob$WorkerThread (run) - Ended ...with 9 samples" should really mean that the archiverappliance.retrieval.client code only provided 9 samples. If the basic size of the network data returned by the getData.raw servlet appears to be much larger, the error must be in the archiverappliance.retrieval.client code. I have no idea about the archive appliance, and didn't work on the archiverappliance.retrieval.client code. The network call with "pv=optimized_800.." looks like it's fetching optimized data, which in principle is good. |
I just took the .plot file and replaced all OPTIMIZED with RAW. |
@jbobnar Jaka, do you have suggestions how to tell if the problem is in the pbrawclient org.epics.archiverappliance.retrieval.client) or the org.csstudio.archive.reader.appliance code? |
I've never seen this. I tried to reproduce it (on master branch), but it always worked OK. The fact that it always works in DataBrowser but not in OPIs is puzzling. If the problem was in the appliance code, you would probably see it in data browser at least from time to time. Are you using a single server or a cluster with load balancer? Are all of your server instances of the same version? What about CSS version? Your previous comment (#1769 (comment)) suggests that your appliance supports "optimized" retrieval. However, in #1769 (comment) I see that CSS made a call using "ncount" operator, which suggests that your CSS might not be up to date (this was implemented about 9 months ago and is available on 4.2.x and master branches). Can you check the log files on the server? Look for the file /retrieval/logs/localhost-2016...txt. It should contain all GET requests that the clients have made. The last number in each log entry is the number of bytes in the response of the appliance. Do you see any logs with significantly lower sizes? That would suggest that the appliance is sending back less points (only 9 samples). |
Please note that it does also happen also in the databrowser, though less frequently. It's a single appliance running the January version, all services on one server. From the log, it seems all actual requests are using mean_4 reduction, requested period is 1h. When I check for that over a day, the sizes are between 15415 and 15455 (looking only at PVs that are noisy so should produce a sample virtually every second). This is consistent with my 100-request wget attempt. CSS clients are 4.1 RCP (the live version), 4.2 RCP (just used by me to test with the latest version I could quickly build), 4.2 RAP, so indeed there might be a mix of versions producing above output. It happens just the same in all versions I tried. Unfortunately, the system is going down at 13:30 CEST today (i.e. in less than 4 hours), so I probably won't be able to debug any more on the live system after that. I'll try to reproduce in my staging setup. |
Do you have a proxy in between? I saw something similar with a squid proxy between the archive appliance and databrowser (any client). |
No, direct access. I actually also tested from localhost. |
This might really not be very useful information, but I have noticed the same at NSLS2 too. I have seen this even when there is no proxy or gateway in between. I have been trying to use my local image setup of the archiver to recreate the problem but not had much luck with reproducing it in my small test setup. |
P.S I have noticed this problem with cs-studio 4.1.x |
Kunal, thanks for looking into the archive appliance code to fix this issue in there! The data browser will read data for all channels in the plot in parallel, so implementations of org.csstudio.archive.reader.ArchiveReader must be multithreaded. Specifically, if there are N channels, the ArchiveReaderFactory provided by the data source will be asked for N ArchiveReaders, and those will then each be called to read data for a channel. |
My lock is in the ApplianceValueIterator class, @kasemir is that alright? I also tried writing a stress test for the RawDataRetrieval client - but it works fine.
|
The ArchiveReader is an interface. org.csstudio.archive.reader.applicance.ApplianceArchiveReader is the implementation for the appliance. |
Sorry I was referring to the ApplianceArchiverReader when asking if it should be threadsafe. I have tried to keep the lock as narrow as possible. |
@shroffk your patch works for me. I should mention #1274 here. This was noticed then, and it was assumed the web proxy had something to do with it. It seems a web proxy just makes this threading issue worse. I can recreate the lost data with a web proxy, and your code snippet fixes that too. Although, looking this over, it seems this could be cleaned up a bit more. Project? |
I think my synchronization is addressing a symptom as opposed to the actual problem...but at least we don't have missing data. |
@shroffk |
I am testing with postprocessors...the test I ran was with optimized_800() |
I frequently notice that the databrowser widget embedded in an OPI fails to display all archived data. It's always the time just before switching to live data where data is missing.
I never notice this when I right-click the widget and open in the full data browser using the same .plt file. See attached screenshots of roughly the same time period. The OPI has just been opened, i.e. live data is only at the very right of the display.
The archiver we use is the archiver appliance.
The text was updated successfully, but these errors were encountered: