Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClickHouse reads 40 million data and writes it into MySQL. When 35 million data is read, an error is reported. #1550

Open
T-M-L-C opened this issue Feb 23, 2024 · 17 comments
Labels
bug investigating Investigation of a root cause is on going

Comments

@T-M-L-C
Copy link

T-M-L-C commented Feb 23, 2024

clickhouse-jdbc: version(0.4.6)
error message: java.sql.SQLException: java.io.StreamCorruptedException: Reached end of input stream after reading 6719 of 10487 bytes
clickhouse: version(23.3.2.1)
error message:
Code: 24.DB::Exception: Cannot write to ostream at offset 1082130432: While executing ParallelFormattingOutputFormat.(CANNOT_WRITE_TO_OSTREAM)

@T-M-L-C T-M-L-C added the bug label Feb 23, 2024
@T-M-L-C
Copy link
Author

T-M-L-C commented Feb 23, 2024

image

@chernser
Copy link
Contributor

Good day, @T-M-L-C !
Is the problem reproducible? Is it on the latest version, too?

Thanks in advance!

@chernser chernser added the investigating Investigation of a root cause is on going label Jun 18, 2024
@TimonSP
Copy link
Contributor

TimonSP commented Jul 23, 2024

There is a similar issue, but it is intermittent issue, occurs mostly when client side is under pressure (several statements)

ClickHouse:

Code: 24. DB::Exception: Cannot write to ostream at offset 7340032: While executing ParallelFormattingOutputFormat. (CANNOT_WRITE_TO_OSTREAM) (version 23.8.14.6 (official build))

Client side:

Error reading from db: java.io.StreamCorruptedException: Reached end of input stream after reading 70390 of 93625 bytes

Any thought what could be the reason? And how to avoid such failures?

@chernser
Copy link
Contributor

chernser commented Jul 23, 2024

@TimonSP What are your client / server versions?
(removed internal link)

@TimonSP
Copy link
Contributor

TimonSP commented Jul 23, 2024

@chernser, sorry, link is not work for me (Page not found)

ClickHouse version 23.8.14.6
clickhouse-java version 0.4.6

@wallacms
Copy link

wallacms commented Aug 21, 2024

I'm seeing this same issue using:
Clickhouse 23.9.1.1854
clickhouse-java version 0.6.3

Similar to @TimonSP we see this intermittently, but frequently. We are not able to reproduce on demand, but we could turn on any debug logging that might be helpful. It always occurs when reading large amounts of records from Clickhouse (> 10M). We see this both when using the Clickhouse JDBC driver and when using the native HTTP client.

One guess I had was that this could be related to TCP send/recv buffers. Is this a plausible explanation? How does Clickhouse handle the TCP send buffer being full?

Also, the link to the issue does not work for me either. I think this is because it's in the "clickhouse-private" repo. Can you give some more information about what this issue references?

@TimonSP
Copy link
Contributor

TimonSP commented Sep 2, 2024

@chernser could you please copy-paste ClickHouse issue at [internal link]
so we could check if it is the case

@TimonSP
Copy link
Contributor

TimonSP commented Sep 2, 2024

I was managed to reproduce issue using Thread.sleep as load simulation.
I'm using simple query for 10KK numbers generation and two pauses on 25 seconds after 1k and 2k.
The issue is reproducible each time on such pause (for shorter pause sometimes succeeded).

import com.clickhouse.jdbc.ClickHouseDataSource;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Properties;

public class ConsoleTestException {
    public static void main(String[] args) {
        String user_name = "***";
        String user_passsword = "***";
        String url = "jdbc:clickhouse://***:8123";
        String query = "SELECT toString(cityHash64(number)) FROM numbers(10000000)";

        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");

        Properties properties = new Properties();

        ClickHouseDataSource dataSource;
        Connection connection;
        PreparedStatement statement;
        ResultSet rs;
        int idx = 1;
        try {
            dataSource = new ClickHouseDataSource(url, properties);
            connection = dataSource.getConnection(user_name, user_passsword);
            statement = connection.prepareStatement(query);
            rs = statement.executeQuery();
            while (rs.next()) {
                if (idx % 100 == 0){
                    System.out.println(String.format("[%s] %s", sdf.format(new Date()), idx));

                    if (idx == 1000 || idx == 2000){
                        try {
                            Thread.sleep(25000);
                        } catch (InterruptedException e) {
                            e.printStackTrace();
                        }
                        System.out.println(String.format("[%s] sleep completed", sdf.format(new Date())));
                    }
                }
                idx++;
            }
            System.out.println(String.format("[%s] total %s", sdf.format(new Date()), idx));
        } catch (SQLException e) {
            System.out.println(String.format("[%s] last %s", sdf.format(new Date()), idx));
            e.printStackTrace();
        }
    }
}

But I'm afraid it is ClickHouse issue, it is reproducible with python clickhouse_connect too

import clickhouse_connect
import time

if __name__ == '__main__':
    client = clickhouse_connect.get_client(
        host="***",
        port=8123,
        username="***",
        password="***"
    )
    with client.query_rows_stream('SELECT toString(cityHash64(number)) FROM numbers(10000000)') as stream:
        idx = 1
        for row in stream:
            if idx % 1000 == 0:
                print(idx)
            if idx == 1000 or idx == 2000:
                time.sleep(25)
            idx += 1
            pass

Fails with clickhouse_connect.driver.exceptions.StreamFailureError

@TimonSP
Copy link
Contributor

TimonSP commented Sep 3, 2024

The same code on the 24.3.7.30 provided another (more clear) exception:

Code: 209. DB::NetException: Timeout exceeded while writing to socket ([::ffff:10.131.0.27]:55976, 30000 ms): While executing ParallelFormattingOutputFormat. (SOCKET_TIMEOUT) (version 24.3.7.30 (official build)) (from [::ffff:***]:55976) (in query: SELECT toString(cityHash64(number)) FROM numbers(10000000)

That helped to find something that looks like root cause of the issue, it is http_send_timeout but it should be set for default profile only and restart would required, see ClickHouse/ClickHouse#64731 for details

I've a bit problem to check this out (as far as I don't have access to change default profile)
@wallacms , hope it would help you

@wallacms
Copy link

wallacms commented Sep 3, 2024

I can confirm that we fixed this problem by increasing the http_send_timeout in the default profile.

@chernser
Copy link
Contributor

chernser commented Sep 3, 2024

@TimonSP @wallacms sorry it is an internal link.
It is about memory allocation, but I think it is not related now. If there is a memory issue - it should be reported on the server side.
There is a PR ClickHouse/ClickHouse#51732 related to that issue. Change was done recently this year.

@wallacms

  • 10M+ result should not create an issue. With cloud instance there is no problem to read >1.6 Gb datasets with current java client.
  • you may increase logger level to debug on the server side and see when http session is created. Server will output Logout when request is done. If anything goes wrong - server will report it.

@chernser
Copy link
Contributor

chernser commented Sep 3, 2024

@TimonSP
http_send_timeout from my understanding should be a problem only for slow readers (clients) (here is CH code https://github.com/ClickHouse/ClickHouse/blob/master/src/Server/HTTP/HTTPServerRequest.cpp and here is description of what is send/recv timeout for socket:

Specify the receiving or sending timeouts until reporting an error. The argument is a struct timeval. If an input or output function blocks for this period of time, and data has been sent or received, the return value of that function will be the amount of data transferred;

I read it like this http_send_timeout is defines for how long server will hold unsent data before giving up. In your example code with Thread.sleep you are emulating such case. However timeout of 30 seconds is big enough and it means that client is not receiving data at all for long time.
@wallacms mentioned socket buffers - it may help to release server connection, but would not solve a problem, I think.

Summary

  • increasing timeout may be hides a problem
  • we need to understand why client/server is not sending data at each case
    • is it a GC pause
    • is it lock somewhere
    • is it because reading from source is slow

@wallacms
Copy link

wallacms commented Sep 3, 2024

@chernser We've observed this problem primarily in the context of accessing Clickhouse through Dremio. Since Dremio is a distributed execution engine, it can have long pauses in reading results from Clickhouse while it is servicing other queries or doing upstream processing on the results it gets from Clickhouse. Thus, in our case increasing the the timeout is a reasonable thing to do because we expect long pauses.

@TimonSP
Copy link
Contributor

TimonSP commented Sep 4, 2024

@chernser
It looks like normal case, there is no problem in reading or on server, but the data is processing during the read.
In my case it is BI tool that read data from ClickHouse and store in own storage, so the last action can take time.

@chernser
Copy link
Contributor

chernser commented Sep 4, 2024

@wallacms @TimonSP thank you for sharing your use-cases.

@TimonSP how do you handle errors like:

  • writing to BI tool storage failed
  • reading from ClickHouse failed

Do you track what data is already synced or do you repeat whole request?

@TimonSP
Copy link
Contributor

TimonSP commented Sep 4, 2024

@chernser whole process repeats, it is a third party tool for us so we cannot control its behavior

BTW, I've found that the setting was decreased in 23.6 ClickHouse/ClickHouse#51171

@laeg
Copy link

laeg commented Nov 5, 2024

@TimonSP - which BI tool are you using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug investigating Investigation of a root cause is on going
Projects
None yet
Development

No branches or pull requests

5 participants