-
-
Notifications
You must be signed in to change notification settings - Fork 219
Fix querying and creation of a large matrix due to integer overflow #338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the PR, that looks interesting. Would you be able to supply an example where the old code break, and ideally where the new code passes? Even better, could you add that as a unit test? |
I think https://github.com/wch/r-source/blob/trunk/src/main/array.c#L202-#L213 |
Actually, I think that almost all occurences of 'int' and 'size_t' should be replaced by 'R_xlen_t' which is a typedef of 'ptr_diff_t' that "acts as the signed counterpart of std::size_t" (cf: RInternals.h) |
In my opinion, if So in R's own source code, https://github.com/wch/r-source/blob/trunk/src/main/array.c#L202-#L213 |
You are right. |
This code doesn't produce the expected result. #include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix test() {
const int nrow = 4000000;
const int ncol = 700;
NumericMatrix v(nrow,ncol);
return v;
}
/*** R
test()
*/
|
I agree that changing from E.g. https://github.com/wch/r-source/blob/trunk/src/main/array.c#L148-L151 |
Quick check: R> 4000000 * 700 * 8 / 1e9
[1] 22.4
R> 22 gb of RAM for a single matrix is not all that common, but not impossible. And on board with the rest of the discussion: we ought to have the proper interface to R. |
This code probably produces a segfault but I will not have an access to a machine with enought RAM to test it until tomorrow. #include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
void test(const NumericMatrix& m) {
Rcout << m(m.nrow()-1,m.ncol()-1) << std::endl;
}
/*** R
m=matrix(rep(1.1,4000000*700), nrow=4000000,ncol=700)
test(m)
*/ |
For your information, a lot laboratories are using R to analyse matrices with much more than 2^31-1 elements |
The problem is in the dimension class: https://github.com/RcppCore/Rcpp/blob/master/inst/include/Rcpp/Dimension.h#L59-L61
|
Nicely done, KK. Agree with the issue in Dimension.h. And yes, the code does segfault (on our largest server with lots of ram): R> cppFunction("void mattest(NumericMatrix m) { Rcout << m(m.nrow()-1,m.ncol()-1) << std::endl; }")
R> m <- matrix(1.1, nrow=4000000,ncol=700)
R> mattest(m)
*** caught segfault ***
address 0x7f96c8671040, cause 'memory not mapped'
Traceback:
1: .Primitive(".Call")(<pointer: 0x7f9ec8675c40>, m)
2: mattest(m)
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: |
This constructor is also problematic template <typename Iterator>
Matrix( const int& nrows_, const int& ncols, Iterator start ) :
VECTOR( start, start + (nrows_*ncols) ), // Potential overflow. Need to cast to R_xlen_t
nrows(nrows_)
{
VECTOR::attr( "dim" ) = Dimension( nrows, ncols ) ;
} |
So can you update your PR according to our discussion? @FPlaza |
What about those operators? inline Proxy operator()( const size_t& i, const size_t& j) {
return static_cast< Vector<RTYPE>* >( this )->operator[]( offset( i, j ) ) ;
}
inline const_Proxy operator()( const size_t& i, const size_t& j) const {
return static_cast< const Vector<RTYPE>* >( this )->operator[]( offset( i, j ) ) ;
} |
They should be https://github.com/RcppCore/Rcpp/blob/master/inst/include/Rcpp/vector/Matrix.h#L160 |
I am not very familiar with git. |
Change your file. Commit it again. |
@@ -58,7 +58,7 @@ class Matrix : public Vector<RTYPE, StoragePolicy>, public MatrixBase<RTYPE, tru | |||
|
|||
template <typename Iterator> | |||
Matrix( const int& nrows_, const int& ncols, Iterator start ) : | |||
VECTOR( start, start + (nrows_*ncols) ), | |||
VECTOR( start, start + (static_cast<R_xlen_t>(nrows)_*ncols) ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a typo here, static_cast<R_xlen_t>(nrows_)
@@ -58,7 +58,7 @@ class Matrix : public Vector<RTYPE, StoragePolicy>, public MatrixBase<RTYPE, tru | |||
|
|||
template <typename Iterator> | |||
Matrix( const int& nrows_, const int& ncols, Iterator start ) : | |||
VECTOR( start, start + (nrows_*ncols) ), | |||
VECTOR( start, start + (static_cast<R_xlen_t>(nrows_)*ncols) ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rather VECTOR( start, start + (static_cast<R_xlen_t>(nrows_ * ncols)))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to cast before multiply, otherwise, there is still an overflow, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh. Good point.
Do the changes satisfy you? |
@@ -39,14 +39,14 @@ namespace Rcpp{ | |||
dims = other.dims ; | |||
return *this ; | |||
} | |||
Dimension(const size_t& n1) : dims(1){ | |||
Dimension(const int& n1) : dims(1){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But int
is signed and size_t
is not. I think you just reduced the range of numbers which can be expressed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, dims is an 'int' vector so 'size_t' will be downcasted to 'int' at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not that lighthearted about changing existing interfaces. Right now I am inclined not to take the patch. Lines 59 and 60 in Dimension.h
look fine as does line 61 in Matrix.h
. I am not so sure about the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patches make the Rcpp interface coherent with the RInternals (which is currently not the case.)
Tell me what are your concerns and I will answer it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For coherence with R's interface, I think it's best to use int
here. The std::size_t
items used would still get casted to int
later which of course would cause problems if we had a std::size_t
which did not fit in int
.
That said, this is a change that could potentially break ABI and the ultimate effect is 'correctness' but nothing really user-visible unless one attempts to construct a dimension that doesn't fit in an int
so it's a bit more difficult to accept.
I think it needs more discussion.
|
I am closing this PR. |
No description provided.