Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"failed to fill whole buffer" with multiple nodes #245

Closed
psachs opened this issue Mar 1, 2019 · 10 comments
Closed

"failed to fill whole buffer" with multiple nodes #245

psachs opened this issue Mar 1, 2019 · 10 comments

Comments

@psachs
Copy link

psachs commented Mar 1, 2019

When configuring multiple scylla nodes, insert failes with an error "failed to fill whole buffer":

thread '<unnamed>' panicked at 'Failed to insert data: Io(Custom { kind: UnexpectedEof, error: StringError("failed to fill whole buffer") })', libcore/result.rs:1009:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
             at libstd/sys_common/backtrace.rs:71
             at libstd/sys_common/backtrace.rs:59
   2: std::panicking::default_hook::{{closure}}
             at libstd/panicking.rs:211
   3: std::panicking::default_hook
             at libstd/panicking.rs:227
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:476
   5: std::panicking::continue_panic_fmt
             at libstd/panicking.rs:390
   6: rust_begin_unwind
             at libstd/panicking.rs:325
   7: core::panicking::panic_fmt
             at libcore/panicking.rs:77
   8: core::result::unwrap_failed
             at libcore/macros.rs:26
   9: <core::result::Result<T, E>>::expect
             at libcore/result.rs:835
  10: <unknown>
             at src/main.rs:62

The issue can be reproduced with the following setup:

  1. Setup scylla cluster with 3 nodes
docker run --name scylla-0 -p 9042:9042 -d scylladb/scylla
# wait until scylla-0 is up: docker exec -it scylla-0 nodetool status
docker run --name scylla-1 -p 9043:9042 -d scylladb/scylla --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' scylla-0)"
docker run --name scylla-2 -p 9044:9042 -d scylladb/scylla --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' scylla-0)"
  1. Create column family within cqlsh: docker exec -it scylla-0 cqlsh
CREATE KEYSPACE test
WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 2};

use test;

CREATE TABLE data (
    id int,
    epoch_utc timestamp,
    value double,
    PRIMARY KEY (id, epoch_utc))
WITH CLUSTERING ORDER BY (epoch_utc DESC)
AND COMPACTION = {
    'class': 'TimeWindowCompactionStrategy',
    'compaction_window_unit': 'DAYS',
    'compaction_window_size': '1'
};
  1. Cargo.toml
[package]
name = "rust_test_scylla"
version = "0.1.0"
authors = ["Pascal Sachs <pascal.sachs@sensirion.com>"]
edition = "2018"

[dependencies]
cdrs = "^2.0.0-beta.6"
cdrs_helpers_derive = "0.1.0"
rand = "^0.6.5"

[profile.dev]
panic = "abort"

[profile.release]
panic = "abort"
  1. main.rs
#[macro_use]
extern crate cdrs;
#[macro_use]
extern crate cdrs_helpers_derive;

use std::sync::Arc;
use std::thread;
use std::time::SystemTime;

use rand::prelude::*;

use cdrs::authenticators::NoneAuthenticator;
use cdrs::cluster::session::{new_lz4 as new_session, Session};
use cdrs::cluster::{ClusterTcpConfig, NodeTcpConfigBuilder, TcpConnectionPool};
use cdrs::frame::IntoBytes;
use cdrs::load_balancing::RoundRobinSync;
use cdrs::query::*;
use cdrs::types::prelude::*;

type ScyllaSession = Session<RoundRobinSync<TcpConnectionPool<NoneAuthenticator>>>;

const SCYLLA_NODES: &'static [&'static str] =
    &["localhost:9042", "localhost:9043", "localhost:9044"];

const INSERT_DATA: &'static str = "\
                                   INSERT INTO test.data \
                                   (id, epoch_utc, value) \
                                   VALUES (?, ?, ?)";

fn main() {
    let nodes = SCYLLA_NODES
        .into_iter()
        .map(|addr| NodeTcpConfigBuilder::new(addr, NoneAuthenticator {}).build())
        .collect();
    let cluster_config = ClusterTcpConfig(nodes);
    let session: Arc<ScyllaSession> = Arc::new(
        new_session(&cluster_config, RoundRobinSync::new())
            .expect("Could not connect to scylla cluster"),
    );

    for i in 0..20 {
        let thread_session = session.clone();
        thread::spawn(move || {
            let mut rng = rand::thread_rng();
            let query_insert_data = thread_session
                .prepare(INSERT_DATA)
                .expect("Failed to prepare insert data query");

            let id = i;
            let epoch_utc = (1_000
                * SystemTime::now()
                    .duration_since(SystemTime::UNIX_EPOCH)
                    .unwrap()
                    .as_secs()) as i64;
            let value = rng.gen();
            let values = DataStruct {
                id,
                epoch_utc,
                value,
            }
            .into_query_values();
            thread_session
                .exec_with_values(&query_insert_data, values)
                .expect("Failed to insert data");
        })
        .join()
        .expect("thread error");
    }
}

#[derive(Clone, Debug, IntoCDRSValue, PartialEq)]
pub struct DataStruct {
    id: i32,
    epoch_utc: i64,
    value: f64,
}

impl DataStruct {
    fn into_query_values(self) -> QueryValues {
        query_values!(self.id, self.epoch_utc, self.value)
    }
}
@AlexPikalov
Copy link
Owner

AlexPikalov commented Mar 1, 2019

Hello @psachs,

Thank you for creating this issue and providing the reproduction steps. I'll have a look on that today. Will keep you updated.

@AlexPikalov
Copy link
Owner

AlexPikalov commented Mar 1, 2019

@psachs
In spare time, could you please try to use v3 feature of CDRS? By default CDRS uses version 4 of CQL binary protocol, but Scylla in some previous versions (I didn't check the most recent one) used 3-rd version.

[dependencies]
cdrs = { version = "^2.0.0-beta.6", default-features = false, features = ["v3"] }

@psachs
Copy link
Author

psachs commented Mar 1, 2019

@AlexPikalov Thank you for the fast response, I updated the dependency as you requested. It did not help with the issue.
The current scylla seems to use v4

[cqlsh 5.0.1 | Cassandra 3.0.8 | CQL spec 3.3.1 | Native protocol v4]

@AlexPikalov
Copy link
Owner

@psachs Thanks for the feedback. Then, as I said, hope I'll be able to come up with something later today.

@psachs
Copy link
Author

psachs commented Mar 1, 2019

@AlexPikalov Thank you very much for your support. Let me know if I can help you with anything

@AlexPikalov
Copy link
Owner

AlexPikalov commented Mar 1, 2019

So far I was able to figure out what Io(Custom { kind: UnexpectedEof, error: StringError("failed to fill whole buffer") }) really means. Error frame parser had a bug so it wasn't able to properly parse an another error which is Unprepared(UnpreparedError { id: CBytesShort { bytes: Some([241, 127, 142, 119, 197, 190, 59, 121, 10, 103, 147, 178, 235, 162, 184, 230]) } }) }) (No prepared statement with ID f17f8e77c5be3b790a6793b2eba2b8e6 found.) which means that a client tried to execute an unprepared query. In order to figure out what is wrong with the app you provided I'd need to continue the analysis of this problem.

As for the error frame parser bug, I've prepared a fix for that #246

@AlexPikalov
Copy link
Owner

AlexPikalov commented Mar 2, 2019

@psachs

The problem with unprepared statement is caused by by the fact that each new request obtains a cluster node where a request will be sent. In your code each thread makes two requests to a server - thread_session.prepare(INSERT_DATA) and thread_session.exec_with_values(&query_insert_data, values). Most probably there was made a false assumption that both of these calls will be made to the same node. However it's not true. In case of round robin load balancer, if prepare is sent to the node N then the next request which is exec_with_values will be sent to the node N + 1. But node N + 1 doesn't have yet a prepared query with provided ID.

In order to make your flow working CDRS load balancer should be able to find a node that has a given query_insert_data prepared query and send an exec_with_values request exactly to that node. But currently CDRS doesn't have such feature.

An option that would make your code working without waiting for an implementation of new feature is to make query_with_values instead of combination of thread_session.prepare(INSERT_DATA) and thread_session.exec_with_values(&query_insert_data, values). The main cons of this approach is that we're sending query string with each request.

@psachs
Copy link
Author

psachs commented Mar 2, 2019

@AlexPikalov Thank you very much for the clarification. Yes I can get rid of the prepared statements for now.
And of course I really appreciate your fast response and fast fixing of the issue. I can test again on Monday and let you know if everything worked out.

@psachs
Copy link
Author

psachs commented Mar 4, 2019

@AlexPikalov
I just tested the master branch for around 2 hours with some realistic load and everything works like a charm without the prepared statements.

I therefore consider this bug as fixed and I am looking forward for the beta7 release ;-)

@psachs psachs closed this as completed Mar 4, 2019
@AlexPikalov
Copy link
Owner

@psachs

Thanks again for reporting this issue! Beta 7 with fix of unprepared error was published.

As for modification of load balancer I'll create an issue and will try to come up with some approach to this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants