noctua 2.6.0
Feature:
- Add clear_s3_resource parameter to RAthena_options to prevent AWS Athena output AWS S3 resource being cleared up by
dbClearResult(RAthena # 168). Thanks to @juhoautio for the request. - Support extra
pawsparameters (RAthena # 169) - Support
endpoint_overrideparameter allow default endpoints for each service to be overridden accordingly (RAthena # 169). Thanks to @aoyh for the request and checking the package in development.
noctua 2.5.1
Bug Fix:
- Fixed unit test helper function
test_datato usesizeparameter explicitly.
noctua 2.5.0
Feature:
- Allow all information messages to be turned off (#178)
- Allow
noctua_optionsto change 1 parameter at a time without affecting other pre-configured settings - Return warning message for deprecated
retry_quietparameter innoctua_optionsfunction.
noctua 2.4.0
Feature:
- Add support
dbplyr2.0.0 backend API. - Add method to set unload on a package level to allow
dplyrto benefit fromAWS Athena unloadmethods (#174).
Bug Fix:
- Ensure
dbGetQuery,dbExecute,dbSendQuery,dbSendStatementwork on older versions ofR(#170). Thanks to @tyner for identifying issue. - Caching would fail when statement wasn't a character (#171). Thanks to @ramnathv for identifying issue.
noctua 2.3.0
Feature:
- Add support to
AWS Athena UNLOAD(#160). This is to take advantage of read/write speedparquethas to offer.
import awswrangler as wr
import getpass
bucket = getpass.getpass()
path = f"s3://{bucket}/data/"
if "awswrangler_test" not in wr.catalog.databases().values:
wr.catalog.create_database("awswrangler_test")
cols = ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"]
df = wr.s3.read_csv(
path="s3://noaa-ghcn-pds/csv/189",
names=cols,
parse_dates=["dt", "obs_time"]) # Read 10 files from the 1890 decade (~1GB)
wr.s3.to_parquet(
df=df,
path=path,
dataset=True,
mode="overwrite",
database="awswrangler_test",
table="noaa"
);
wr.catalog.table(database="awswrangler_test", table="noaa")library(DBI)
con <- dbConnect(noctua::athena())
# Query ran using CSV output
system.time({
df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa")
})
# Info: (Data scanned: 80.88 MB)
# user system elapsed
# 57.004 8.430 160.567
noctua::noctua_options(cache_size = 1)
# Query ran using UNLOAD Parquet output
system.time({
df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa", unload = T)
})
# Info: (Data scanned: 80.88 MB)
# user system elapsed
# 21.622 2.350 39.232
# Query ran using cache
system.time({
df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa", unload = T)
})
# Info: (Data scanned: 80.88 MB)
# user system elapsed
# 13.738 1.886 11.029 noctua 2.2.0
Bug Fix:
sql_translate_envcorrectly translates R functionsquantileandmediantoAWS Athenaequivalents (#153). Thanks to @ellmanj for spotting issue.
Feature:
- Support
AWS Athenatimestamp with time zonedata type. - Properly support data type
listwhen converting data toAWS AthenaSQLformat.
library(data.table)
library(DBI)
x = 5
dt = data.table(
var1 = sample(LETTERS, size = x, T),
var2 = rep(list(list("var3"= 1:3, "var4" = list("var5"= letters[1:5]))), x)
)
con <- dbConnect(noctua::athena())
#> Version: 2.2.0
sqlData(con, dt)
# Registered S3 method overwritten by 'jsonify':
# method from
# print.json jsonlite
# Info: Special characters "\t" has been converted to " " to help with Athena reading file format tsv
# var1 var2
# 1: 1 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 2: 2 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 3: 3 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 4: 4 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 5: 5 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
#> Version: 2.1.0
sqlData(con, dt)
# Info: Special characters "\t" has been converted to " " to help with Athena reading file format tsv
# var1 var2
# 1: 1 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 2: 2 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 3: 3 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 4: 4 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 5: 5 1:3|list(var5 = c("a", "b", "c", "d", "e"))v-2.2.0 now converts lists into json lines format so that AWS Athena can parse with sql array/mapping/json functions. Small down side a s3 method conflict occurs when jsonify is called to convert lists into json lines. jsonify was choose in favor to jsonlite due to the performance improvements (#156).
noctua 2.1.0
Bug Fix:
dbIsValidwrongly stated connection is valid for result class when connection class was disconnected.sql_translate_env.pastebroke with latest version ofdbplyr. New method is compatible withdbplyr>=1.4.3(#149).
Feature:
sql_translate_env: add support forstringr/lubridatestyle functions, similar to Postgres backend.write_binnow doesn't chunk writeBin if R version is greater than 4.0.0 HenrikBengtsson/Wishlist-for-R#97 (#149)dbConnectaddtimezoneparameter so that time zone betweenRandAWS Athenais consistent.
noctua 2.0.1
Bug Fix:
- Fix issue of keyboard interrupt failing to raise interrupt error.
noctua 2.0.0
API Change
AthenaConnectionclass:ptrandinfoslots changed fromlisttoenvironmentwith inAthenaConnectclass. Allows class to be updated by reference. Simplifies notation when viewing class from RStudio environment tab.AthenaResultclass:infoslot changed fromlisttoenvironment. Allows class to be updated by reference.
By utilising environments for AthenaConnection and AthenaResult, all AthenaResult classes created from AthenaConnection will point to the same ptr and info environments for it's connection. Previously ptr and info would make a copy. This means if it was modified it would not affect the child or parent class for example:
# Old Method
library(DBI)
con <- dbConnect(noctua::athena(),
rstudio_conn_tab = F)
res <- dbExecute(con, "select 'helloworld'")
# modifying parent class to influence child
con@info$made_up <- "helloworld"
# nothing happened
res@connection@info$made_up
# > NULL
# modifying child class to influence parent
res@connection@info$made_up <- "oh no!"
# nothing happened
con@info$made_up
# > "helloworld"
# New Method
library(DBI)
con <- dbConnect(noctua::athena(),
rstudio_conn_tab = F)
res <- dbExecute(con, "select 'helloworld'")
# modifying parent class to influence child
con@info$made_up <- "helloworld"
# picked up change
res@connection@info$made_up
# > "helloworld"
# modifying child class to influence parent
res@connection@info$made_up <- "oh no!"
# picked up change
con@info$made_up
# > "oh no!"New Feature
- Added support to
AWS Athenadata types[array, row, map, json, binary, ipaddress](#135). Conversion types can be changed throughdbConnectandnoctua_options.
library(DBI)
library(noctua)
# default conversion methods
con <- dbConnect(noctua::athena())
# change json conversion method
noctua_options(json = "character")
noctua:::athena_option_env$json
# [1] "character"
# change json conversion to custom method
noctua_options(json = jsonify::from_json)
noctua:::athena_option_env$json
# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024)
# {
# json_to_r(json, simplify, fill_na, buffer_size)
# }
# <bytecode: 0x7f823b9f6830>
# <environment: namespace:jsonify>
# change bigint conversion without affecting custom json conversion methods
noctua_options(bigint = "numeric")
noctua:::athena_option_env$json
# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024)
# {
# json_to_r(json, simplify, fill_na, buffer_size)
# }
# <bytecode: 0x7f823b9f6830>
# <environment: namespace:jsonify>
noctua:::athena_option_env$bigint
# [1] "numeric"
# change binary conversion without affect, bigint or json methods
noctua_options(binary = "character")
noctua:::athena_option_env$json
# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024)
# {
# json_to_r(json, simplify, fill_na, buffer_size)
# }
# <bytecode: 0x7f823b9f6830>
# <environment: namespace:jsonify>
noctua:::athena_option_env$bigint
# [1] "numeric"
noctua:::athena_option_env$binary
# [1] "character"
# no conversion for json objects
con2 <- dbConnect(noctua::athena(), json = "character")
# use custom json parser
con <- dbConnect(noctua::athena(), json = jsonify::from_json)- Allow users to turn off RStudio Connection Tab when working in RStudio (#136). This can be done through parameter
rstudio_conn_tabwithindbConnect.
Bug Fix
AWS Athenausesfloatdata type for the DDL only,noctuawas wrongly parsingfloatdata type back to R. InsteadAWS Athenauses data typerealin SQL functions likeselect casthttps://docs.aws.amazon.com/athena/latest/ug/data-types.html.noctuanow correctly parsesrealto R's data typedouble(#133)- Iterate through each token
AWSreturns to get all results fromAWS Gluecatalogue (#137)
noctua 1.10.0
New Feature
- Added optional formatting to
dbGetPartition. This simply tidies up the default AWS Athena partition format.
library(DBI)
library(noctua)
con <- dbConnect(athena())
dbGetPartition(con, "test_df2", .format = T)
# Info: (Data scanned: 0 Bytes)
# year month day
# 1: 2020 11 17
dbGetPartition(con, "test_df2")
# Info: (Data scanned: 0 Bytes)
# partition
# 1: year=2020/month=11/day=17- Support different formats for returning
bigint, this is to align with other DBI interfaces i.e.RPostgres. Nowbigintcan be return in the possible formats: ["integer64", "integer", "numeric", "character"]
library(DBI)
con <- dbConnect(noctua::athena(), bigint = "numeric")
When switching between the different file parsers the bigint to be represented according to the file parser i.e. data.table: "integer64" -> vroom: "I".
Bug Fix
dbRemoveTable: Check if key has "." or ends with "/" before adding "/" to the end (#125)- Added
uuidminimum version to fix issue (#128)
Documentation
- Added note to dbRemoveTable doc string around aws athena table Location in Amazon S3.
noctua 1.9.1
Note
- Added package checks to unit tests when testing a suggested dependency. This is to fix "CRAN Package Check Results for Package noctua" for operating system "r-patched-solaris-x86". Error message:
Error: write_parquet requires the arrow package, please install it first and try again
Bug Fix
dbRemoveTablewould error if AWS S3 files for Athena table have been removed:
Error in seq.default(1, length(l), 1000) : wrong sign in 'by' argument
Now a warning message will be returned:
Warning message:
Failed to remove AWS S3 files from: "s3://{bucket}/{prefix}/". Please check if AWS S3 files exist.
noctua 1.9.0
Minor Change
dbRemoveTablenow removes AWS S3 objects usingdelete_objectsinstead ofdelete_object. This allowsnoctuato delete AWS S3 files in batches. This will reduce the number of api calls to AWS and comes with a performance improvement.
library(DBI)
library(data.table)
X <- 1010
value <- data.table(x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
con <- dbConnect(noctua::athena())
# create a removable table with 1010 parquet files in AWS S3.
dbWriteTable(con, "rm_tbl", value, file.type = "parquet", overwrite = T, max.batch = 1)
# old method: delete_object
system.time({dbRemoveTable(con, "rm_tbl", confirm = T)})
# user system elapsed
# 31.004 8.152 115.906
# new method: delete_objects
system.time({dbRemoveTable(con, "rm_tbl", confirm = T)})
# user system elapsed
# 17.319 0.370 22.709 New Feature
- Move
sql_escape_dateintodplyr_integration.Rbackend (RAthena: # 121). - Allow noctua to append to a static AWS s3 location using uuid
Bug Fix
- parquet file.types now use parameter
use_deprecated_int96_timestampsset toTRUE. This puts POSIXct data type in tojava.sql.Timestampcompatible format, such asyyyy-MM-dd HH:mm:ss[.f...]. Thanks to Christian N Wolz for highlight this issue. - When more than 1000 files exist in the back of an Athena table.
dbRemoveTablewill ask the user twice to confirm if they wish to remove the backend files:
Info: The S3 objects in prefix will be deleted:
s3://bucket/path/schema/table
Info: The S3 objects in prefix will be deleted:
s3://bucket/path/schema/table
To overcome this dbRemoveTable will opt for paws::s3()$list_objects_v2 instead of paws::s3()$list_objects when listing s3 objects to be deleted. This allows noctua to iterate over AWS s3 prefix using tokens, instead of deleting objects in chunks.
s3_upload_locationsimplified how s3 location is built. Now s3.location parameter isn't affected and instead only additional components e.g. name, schema and partition.dbplyr v-2.0.0functionin_schemanow wraps strings in quotes, this breaksdb_query_fields.AthenaConnection. Nowdb_query_fields.AthenaConnectionremoves any quotation from the string so that it can searchAWS GLUEfor table metadata. (#117)
noctua 1.8.1
Bug Fix
- Pass a uuid string to
start_query_executionparameterClientRequestToken. This so that theClientRequestTokenis "A unique case-sensitive string used to ensure the request to create the query is idempotent (executes only once)." (#104) - Allowed cache_size to equal 100
- Do not abort if a glue::get_tables api call fails (e.g., due to missing permissions to a specific database or an orphaned Lake Formation resource link) when retrieving a list of database tables with dbListTables, dbGetTables or in Rstudio's Connections pane.
noctua 1.8.0
New Feature
- noctua now supports Keyboard Interrupt and will stop AWS Athena running the query when the query has been interrupted. To keep the functionality of AWS Athena running when
Rhas been interrupt a new parameter has been added todbConnect,keyboard_interrupt. Example:
# Stop AWS Athena when R has been interrupted:
con <- dbConnect(noctua::athena())
# Let AWS Athena keep running when R has been interrupted:
con <- dbConnect(noctua::athena(),
keyboard_interrupt = F)- Added "AWS_DEFAULT_REGION" to supported environmental variables: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-environment-variables
noctua 1.7.1
Minor Change
- Fixed issue where
noctuawould return adata.framefor utilitySQLqueries regardless of backend file parser. This is due toAWS AthenaoutputtingSQL UTILITYqueries as a text file that required to be read in line by line. Nownoctuawill return the correct data format based on file parser set innoctua_optionsfor example:noctua_options("vroom")will returntibbles.
Documentation:
- Added documentation to highlight behaviour
dbClearResultwhen user doesn't have permission to delete AWS S3 objects (#96)
noctua 1.7.0
New Feature
- functions that collect or push to AWS S3 now have a retry capability. Meaning if API call fails then the call is retried (#79)
noctua_optionscontains 2 new parameters to control hownoctuahandles retries.dbFetchis able to return data from AWS Athena in chunk. This has been achieved by passingNextTokentoAthenaResults4 class. This method won't be as fastn = -1as each chunk will have to be process into data frame format.
library(DBI)
con <- dbConnect(noctua::athena())
res <- dbExecute(con, "select * from some_big_table limit 10000")
dbFetch(res, 5000)- When creating/appending partitions to a table,
dbWriteTableopts to usealter tableinstead of standardmsck repair table. This is to improve performance when appending to tables with high number of existing partitions. dbWriteTablenow allows json to be appended to json ddls created with the Openx-JsonSerDe library.dbConvertTablebringsdplyr::computefunctionality to base package, allowingnoctuato use the power of AWS Athena to convert tables and queries to more efficient file formats in AWS S3 (RAthena: # 37).- Extended
dplyr::computeto give same functionality ofdbConvertTable - Added
region_namecheck before making a connection to AWS Athena (RAthena: # 110)
Bug Fix
dbWriteTablewould throwthrottling errorevery now and again,retry_api_callas been built to handle the parsing of data between R and AWS S3.dbWriteTabledid not clear down all metadata when uploading toAWS Athena
Documentation
dbWriteTableadded support ddl structures for user who have created ddl's outside ofnoctua- added vignette around how to use
noctuaretry functionality - Moved all examples requiring credentials to
\dontrun(#91)
noctua 1.6.0
New Feature
- Inspired by
pyathena,noctua_optionsnow has a new parametercache_size. This implements local caching in R environments instead of using AWSlist_query_executions. This is down todbClearResultclearing S3's Athena output when caching isn't disabled noctua_optionsnow hasclear_cacheparameter to clear down all cached data.dbRemoveTablenow utiliseAWS Glueto remove tables fromAWS Gluecatalog. This has a performance enhancement:
library(DBI)
con = dbConnect(noctua::athena())
# upload iris dataframe for removal test
dbWriteTable(con, "iris2", iris)
# Athena method
system.time(dbRemoveTable(con, "iris2", confirm = T))
# user system elapsed
# 0.247 0.091 2.243
# upload iris dataframe for removal test
dbWriteTable(con, "iris2", iris)
# Glue method
system.time(dbRemoveTable(con, "iris2", confirm = T))
# user system elapsed
# 0.110 0.045 1.094 dbWriteTablenow supports uploading json lines (http://jsonlines.org/) format up toAWS Athena(#88).
library(DBI)
con = dbConnect(noctua::athena())
dbWriteTable(con, "iris2", iris, file.type = "json")
dbGetQuery(con, "select * from iris2")Bug Fix
dbConnectdidn't correct pass.internalmetadata for paws objects.- RStudio connection tab functions:
computeHostName&computeDisplayNamenow get region name frominfoobject fromdbConnectS4 class. dbWriteTableappending to existing table compress file type was incorrectly return.Rstudio connection tabcomes into an issue when Glue Table isn't stored correctly (RAthena: # 92)
Documentation
- Added supported environmental variable
AWS_REGIONintodbConnect - Vignettes added:
- AWS Athena Query Cache
- AWS S3 backend
- Changing Backend File Parser
- Getting Started
Unit tests:
- Increase coverage to + 80%
noctua 1.5.1
Bug Fix
writeBin: Only 2^31 - 1 bytes can be written in a single call (and that is the maximum capacity of a raw vector on 32-bit platforms). This means that it will error out with large raw connections. To over come thiswriteBincan be called in chunks. Ifreadris available on system thenreadr::write_fileis used for extra speed.
library(readr)
library(microbenchmark)
# creating some dummy data for testing
X <- 1e8
df <-
data.frame(
w = runif(X),
x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
write_csv(df, "test.csv")
# read in text file into raw format
obj <- readBin("test.csv", what = "raw", n = file.size("test.csv"))
format(object.size(obj), units = "auto")
# 3.3 Gb
# writeBin in a loop
write_bin <- function(
value,
filename,
chunk_size = 2L ^ 20L) {
total_size <- length(value)
split_vec <- seq(1, total_size, chunk_size)
con <- file(filename, "a+b")
on.exit(close(con))
sapply(split_vec, function(x){writeBin(value[x:min(total_size,(x+chunk_size-1))],con)})
invisible(TRUE)
}
microbenchmark(writeBin_loop = write_bin(obj, tempfile()),
readr = write_file(obj, tempfile()),
times = 5)
# Unit: seconds
# expr min lq mean median uq max neval
# R_loop 41.463273 41.62077 42.265778 41.908908 42.022042 44.313893 5
# readr 2.291571 2.40495 2.496871 2.542544 2.558367 2.686921 5- Thanks to @OssiLehtinen for fixing date variables being incorrectly translated by
sql_translate_env(RAthena: # 44)
# Before
translate_sql("2019-01-01", con = con)
# '2019-01-01'
# Now
translate_sql("2019-01-01", con = con)
# DATE '2019-01-01'- Dependency data.table now restricted to (>=1.12.4) due to file compression being added to
fwrite(>=1.12.4) https://github.com/Rdatatable/data.table/blob/master/NEWS.md - R functions
paste/paste0would use defaultdplyr:sql-translate-env(concat_ws).paste0now uses Presto'sconcatfunction andpastenow uses pipes to get extra flexibility for custom separating values.
# R code:
paste("hi", "bye", sep = "-")
# SQL translation:
('hi'||'-'||'bye')- If table exists and parameter
appendset toTRUEthen existing s3.location will be utilised (RAthena: # 73) db_computereturned table name, however when a user wished to write table to another location (RAthena: # 74). An error would be raised:Error: SYNTAX_ERROR: line 2:6: Table awsdatacatalog.default.temp.iris does not existThis has now been fixed with db_compute returningdbplyr::in_schema.
library(DBI)
library(dplyr)
con <- dbConnect(noctua::athena())
tbl(con, "iris") %>%
compute(name = "temp.iris")dbListFieldsdidn't display partitioned columns. This has now been fixed with the call to AWS Glue being altered to include more metadata allowing for column names and partitions to be returned.- RStudio connections tab didn't display any partitioned columns, this has been fixed in the same manner as
dbListFields
New Feature
dbStatisticsis a wrapper aroundpawsget_query_executionto return statistics fornoctua::dbSendQueryresultsdbGetQueryhas new parameterstatisticsto print outdbStatisticsbefore returning Athena results.noctua_options- Now checks if desired file parser is installed before changed file_parser method
- File parser
vroomhas been restricted to >= 1.2.0 due to integer64 support and changes tovroomapi
- Thanks to @OssiLehtinen for improving the speed of
dplyr::tblwhen calling Athena when using the ident method (#64):
library(DBI)
library(dplyr)
con <- dbConnect(noctua::athena())
# ident method:
t1 <- system.time(tbl(con, "iris"))
# sub query method:
t2 <- system.time(tbl(con, sql("select * from iris")))
# ident method
# user system elapsed
# 0.082 0.012 0.288
# sub query method
# user system elapsed
# 0.993 0.138 3.660 Unit test
dplyrsql_translate_env: expected results have now been updated to take into account bug fix with date fields- S3 upload location: Test if the created s3 location is in the correct location
noctua 1.5.0
New Feature
- Added integration into Rstudio connections tab
- Added information message of amount of data scanned by AWS Athena
- Added method to change backend file parser so user can change file parser from
data.tabletovroom. From now on it is possible to change file parser usingnoctua_optionsfor example:
library(noctua)
noctua_options("vroom")- new function
dbGetTablesthat returns Athena hierarchy as a data.frame
Unit tests
- Added data transfer unit test for backend file parser
vroom
Documentation
- Updated R documentation to
roxygen27.0.2
noctua 1.4.0
Major Change
- Default delimited file uploaded to AWS Athena changed from "csv" to "tsv" this is due to separating value "," in character variables. By using "tsv" file type JSON/Array objects can be passed to Athena through character types. To prevent this becoming a breaking change
dbWriteTableappendparameter checks and uses existing AWS Athena DDL file type. Iffile.typedoesn't match Athena DDL file type then user will receive a warning message:
warning('Appended `file.type` is not compatible with the existing Athena DDL file type and has been converted to "', File.Type,'".', call. = FALSE)Bug fix
- Due to issue highlighted by @OssiLehtinen in (RAthena: # 50), special characters have issue being processed when using flat file in the backend.
- Fixed issue where row.names not being correctly catered and returning NA in column names (RAthena: # 41)
- Fixed issue with
INTEGERbeing incorrectly translated insql_translate_env.R - Fixed issue where
as.characterwas getting wrongly translated (RAthena: # 45)
Unit Tests
- Special characters have been added to unit test
data-transfer dbRemoveTablenew parameters are added in unit test- Added row.names to unit test data transfer
- Updated dplyr
sql_translate_envuntil test to cater bug fix
New Feature
- Due to help from @OssiLehtinen,
dbRemoveTablecan now remove S3 files for AWS Athena table being removed.
Minor Change
- Added AWS_ATHENA_WORK_GROUP environmental variable support
- Removed
tolowerconversion due to request (RAthena: # 41)
noctua 1.3.0
Major Change
dbWriteTablenow will splitgzipcompressed files to improve AWS Athena performance. By defaultgzipcompressed files will be split into 20.
Performance results
library(DBI)
X <- 1e8
df <- data.frame(w =runif(X),
x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
con <- dbConnect(noctua::athena())
# upload dataframe with different splits
dbWriteTable(con, "test_split1", df, compress = T, max.batch = nrow(df), overwrite = T) # no splits
dbWriteTable(con, "test_split2", df, compress = T, max.batch = 0.05 * nrow(df), overwrite = T) # 20 splits
dbWriteTable(con, "test_split3", df, compress = T, max.batch = 0.1 * nrow(df), overwrite = T) # 10 splitsAWS Athena performance results from AWS console (query executed: select count(*) from .... ):
- test_split1: (Run time: 38.4 seconds, Data scanned: 1.16 GB)
- test_split2: (Run time: 3.73 seconds, Data scanned: 1.16 GB)
- test_split3: (Run time: 5.47 seconds, Data scanned: 1.16 GB)
library(DBI)
X <- 1e8
df <- data.frame(w =runif(X),
x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
con <- dbConnect(noctua::athena())
dbWriteTable(con, "test_split1", df, compress = T, overwrite = T) # default will now split compressed file into 20 equal size files.Added information message to inform user about what files have been added to S3 location if user is overwriting an Athena table.
Minor Change
copy_tomethod now supports compress and max_batch, to align withdbWriteTable
Bug Fix
- Fixed bug in regards to Athena DDL being created incorrectly when passed from
dbWriteTable - Thanks to @OssiLehtinen for identifying issue around uploading class
POSIXctto Athena. This class was convert incorrectly and AWS Athena would return NA instead.noctuawill now correctly convertPOSIXctto timestamp but will also correct read in timestamp intoPOSIXct - Thanks to @OssiLehtinen for discovering an issue with
NAin string format. Beforenoctuawould returnNAin string class as""this has now been fixed. - When returning a single column data.frame from Athena,
noctuawould translate output into a vector with current the methoddbFetchn = 0. - Thanks to @OssiLehtinen for identifying issue around
sql_translate_env. Previouslynoctuawould take the defaultdplyr::sql_translate_env, nownoctuahas a custom method that uses Data types from: https://docs.aws.amazon.com/athena/latest/ug/data-types.html and window functions from: https://docs.aws.amazon.com/athena/latest/ug/functions-operators-reference-section.html
Unit tests
POSIXctclass has now been added to data transfer unit testdplyr sql_translate_envtests if R functions are correct translated in to Athenasqlsyntax.
noctua 1.2.1
New Features:
- Parquet file type can now be compress using snappy compression when writing data to S3.
Bug fixed
- Older versions of R are returning errors when function
dbWriteTableis called. The bug is due to functionsqlCreateTablewhichdbWriteTablecalls. Parameterstableandfieldswere set toNULL. This has now been fixed.
noctua 1.2.0
Minor Change
s3.locationparameter isdbWriteTablecan now be made nullable
Backend Change
- helper function
upload_datahas been rebuilt and removed the old "horrible" if statement withpastenow the function relies onsprintfto construct the s3 location path. This method now is a lot clearer in how the s3 location is created plus it enables adbWriteTableto be simplified.dbWriteTablecan now upload data to the default s3_staging directory created indbConnectthis simplifiesdbWriteTableto :
library(DBI)
con <- dbConnect(noctua::athena())
dbWriteTable(con, "iris", iris)Bug Fix
- Info message wasn't being return when colnames needed changing for Athena DDL
Unit Tests
data transfertest now tests compress, and default s3.location when transferring data
New Feature
- GZIP compression is now supported for "csv" and "tsv" file format in
dbWriteTable
Minor Change
sqlCreateTableinfo message will now only inform user if colnames have changed and display the column name that have changed
noctua 1.1.0
New Features
- credentials are now passed through the new
config = list()parameter ispawsobjects BigIntare now passed correctly intointeger64
Bug fix
AthenaResultreturned:Error in call[[2]] : object of type 'closure' is not subsettable. The functiondo.callwas causing the issue, to address thisdo.callhas been removed and the helper functionrequesthas been broken down intoResultConfigurationto return a single component ofstart_query_execution- All functions that utilise
do.callhave been broken down due to error:Error in call[[2]] : object of type 'closure' is not subsettable
Unit Tests
- Added
biginttointeger64in data.transfer unit test
Minor Change
- dependency
pawsversion has been set to a minimum of0.1.5due to latest change.
Major Change
data.tableis now used as the default file parserdata.table::fread/data.table::fwrite. This isn't a breaking change asdata.tablewas used before however this change makesdata.tableto default file parser.
noctua 1.0.0
New Features
DBI
dbConnectmethod can use the following methods:- assume role
- aws profile name
- hard coded aws credentials
- set credentials in system variables
- Enabled method to upload parquet file format into AWS S3 using
arrowpackage
Athena lower level api
assume_roledeveloped method for user to assume role when connecting to AWS Athena- developed methods to create, list, delete and get AWS Athena work groups