Hi,
I am having memory issues with data.table on big data. By big data, I mean when number of rows is 200K and number of columns is 12K. I have an example below with very small data to show the steps I am doing, and where with my actual data the steps fail. If you could please verify if I am using data.table in the optimal way, or if not, suggest the best way, that would be very much appreciated!
I am using data.table (version 1.9.6) on R 3.2.2 (64-bit) on a 64-bit Windows 7 machine (with 16 GB of RAM):
Example:
- I have 3 data.table: Dt1, Dt2, Dt3:
RowId = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) # unique values Location = c("Chicago", "Amsterdam", "New York", "LA", "Chicago", "California", "Florida", "New York", "New Jersey", "LA") Dt1 <- data.table(RowId, Location) setkey(Dt1, "RowId") RowId = c(1, 1, 3, 4, 4) DayOfMonth = c("Feb", "Mar", "Apr", "Mar", "May") Dt2 <- data.table(RowId, DayOfMonth) RowId = c(2, 3, 6, 1, 6, 8) DayOfWeek = c("Mon", "Fri", "Wed", "Wed", "Tue", "Mon") Dt3 <- data.table(RowId, DayOfWeek)
- I need to create one table from the above three tables, with the RowId column having unique values, so that this table can be fed as input to classification methods, like rpart (Decision Tree modeling).
2a) I, therefore, use dcast.data.table for Dt2 and Dt3 in order to create the respective RowId column with unique values:
Dt2Wide <- dcast.data.table(Dt2, RowId ~ DayOfMonth, function(x) TRUE, fill = FALSE) Dt3Wide <- dcast.data.table(Dt3, RowId ~ DayOfWeek, function(x) TRUE, fill = FALSE)
2b) Then I do a LEFT JOIN of Dt1 with Dt3Wide and Dt2Wide (so I get all the RowId values from Dt1):
temp <- Dt3Wide[Dt2Wide[Dt1]]
- I then need to remove all NA from the LEFT JOIN in the merged data.table from step 2b.
temp <- replace(temp, is.na(temp), FALSE)
Problems:
In step 2a, when my actual data has 200K rows, and 5K and 7K columns (from the unique values of Dt2::DayOfMonth and Dt3::DayOfWeek, respectively), the resulting space allocated for Dt2Wide and Dt3Wide is 4 GB and 5 GB (when Dt2 and Dt3 are 10-14 MB in size). I can create the 4GB sized Dt2Wide. But get memory allocation error for the 5 GB Dt3Wide object (RAM being 16 GB).
Error: cannot allocate vector of size 5.0 Gb
Furthermore, Step 2b (merging of all 3 tables) results in an object of size 10 GB which again results in memory allocation error during Step 3 (removal of NA).
So is there a way to get around this problem when the data has 200K or more rows, and 12K or more columns?
I have tried using ffdf, and merge.ffdf. But merge.ffdf is much slower than the merge from data.table, and also removal of NA from data.table is super fast compared to that from an ffdf object. So if possible, I would really like to use data.table for my application.
Thanks,
Supriya
Hi,
I am having memory issues with data.table on big data. By big data, I mean when number of rows is 200K and number of columns is 12K. I have an example below with very small data to show the steps I am doing, and where with my actual data the steps fail. If you could please verify if I am using data.table in the optimal way, or if not, suggest the best way, that would be very much appreciated!
I am using data.table (version 1.9.6) on R 3.2.2 (64-bit) on a 64-bit Windows 7 machine (with 16 GB of RAM):
Example:
RowId = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) # unique values Location = c("Chicago", "Amsterdam", "New York", "LA", "Chicago", "California", "Florida", "New York", "New Jersey", "LA") Dt1 <- data.table(RowId, Location) setkey(Dt1, "RowId") RowId = c(1, 1, 3, 4, 4) DayOfMonth = c("Feb", "Mar", "Apr", "Mar", "May") Dt2 <- data.table(RowId, DayOfMonth) RowId = c(2, 3, 6, 1, 6, 8) DayOfWeek = c("Mon", "Fri", "Wed", "Wed", "Tue", "Mon") Dt3 <- data.table(RowId, DayOfWeek)2a) I, therefore, use dcast.data.table for Dt2 and Dt3 in order to create the respective RowId column with unique values:
Dt2Wide <- dcast.data.table(Dt2, RowId ~ DayOfMonth, function(x) TRUE, fill = FALSE) Dt3Wide <- dcast.data.table(Dt3, RowId ~ DayOfWeek, function(x) TRUE, fill = FALSE)2b) Then I do a LEFT JOIN of Dt1 with Dt3Wide and Dt2Wide (so I get all the RowId values from Dt1):
temp <- Dt3Wide[Dt2Wide[Dt1]]temp <- replace(temp, is.na(temp), FALSE)Problems:
In step 2a, when my actual data has 200K rows, and 5K and 7K columns (from the unique values of Dt2::DayOfMonth and Dt3::DayOfWeek, respectively), the resulting space allocated for Dt2Wide and Dt3Wide is 4 GB and 5 GB (when Dt2 and Dt3 are 10-14 MB in size). I can create the 4GB sized Dt2Wide. But get memory allocation error for the 5 GB Dt3Wide object (RAM being 16 GB).
Error: cannot allocate vector of size 5.0 Gb
Furthermore, Step 2b (merging of all 3 tables) results in an object of size 10 GB which again results in memory allocation error during Step 3 (removal of NA).
So is there a way to get around this problem when the data has 200K or more rows, and 12K or more columns?
I have tried using ffdf, and merge.ffdf. But merge.ffdf is much slower than the merge from data.table, and also removal of NA from data.table is super fast compared to that from an ffdf object. So if possible, I would really like to use data.table for my application.
Thanks,
Supriya