Skip to content

Commit

Permalink
Encoding (#207)
Browse files Browse the repository at this point in the history
* Enable reading xlsx files with umlauts in a latin1 Windows R 4.1.2 environment. This needs further clean ups and additional fixes for writing.
Currently this breaks reading and writing on utf8 too. Needs some additional option when to set which pugi::encoding.

* loading a file with umlauts and saving and loading and saving works fine

* write_xml function to write xml files

* fix reading encoding with inlineStr

* switch encoding to latin1 if l10n_info() does not return UTF-8 = TRUE

* fix tests on Windows

* no one needs this. they are called either in utf8 or latin1 encoding, no conversion is required

* convert to XptrXML

* more XPtrXML

* cleanups
* another stringi function
* switch back to print(" "); unintended development mixup
* remove commented code

* remove write_xml, we only use write_xmlPtr

* update NEWS entry
  • Loading branch information
JanMarvin authored May 29, 2022
1 parent 2a457f4 commit 587b96e
Show file tree
Hide file tree
Showing 19 changed files with 217 additions and 188 deletions.
2 changes: 1 addition & 1 deletion NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -139,11 +139,11 @@ importFrom(grDevices,rgb)
importFrom(grDevices,tiff)
importFrom(magrittr,"%>%")
importFrom(stringi,stri_c)
importFrom(stringi,stri_conv)
importFrom(stringi,stri_isempty)
importFrom(stringi,stri_join)
importFrom(stringi,stri_match)
importFrom(stringi,stri_match_all_regex)
importFrom(stringi,stri_read_lines)
importFrom(stringi,stri_replace_all_fixed)
importFrom(stringi,stri_split_fixed)
importFrom(stringi,stri_split_regex)
Expand Down
2 changes: 1 addition & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

* Fix an issue with broken xml in Excels vml files and enable opening xlsm files with `wb$open()` [202](https://github.com/JanMarvin/openxlsx2/pull/202)

* Fix reading and writing on non UTF-8 systems [198](https://github.com/JanMarvin/openxlsx2/pull/198) [199](https://github.com/JanMarvin/openxlsx2/pull/199)
* Fix reading and writing on non UTF-8 systems [198](https://github.com/JanMarvin/openxlsx2/pull/198) [199](https://github.com/JanMarvin/openxlsx2/pull/199) [207](https://github.com/JanMarvin/openxlsx2/pull/207)

* Instruct parser to import nodes with whitespaces. This fixes a complaint in spreadsheet software. [189](https://github.com/JanMarvin/openxlsx2/pull/189)

Expand Down
20 changes: 12 additions & 8 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -49,12 +49,12 @@ loadvals <- function(sheet_data, doc) {
invisible(.Call(`_openxlsx2_loadvals`, sheet_data, doc))
}

readXMLPtr <- function(path, isfile, escapes, declaration) {
.Call(`_openxlsx2_readXMLPtr`, path, isfile, escapes, declaration)
readXMLPtr <- function(path, isfile, escapes, declaration, utf8) {
.Call(`_openxlsx2_readXMLPtr`, path, isfile, escapes, declaration, utf8)
}

readXML <- function(path, isfile, escapes, declaration) {
.Call(`_openxlsx2_readXML`, path, isfile, escapes, declaration)
readXML <- function(path, isfile, escapes, declaration, utf8) {
.Call(`_openxlsx2_readXML`, path, isfile, escapes, declaration, utf8)
}

getXMLXPtrName1 <- function(doc) {
Expand Down Expand Up @@ -129,8 +129,8 @@ printXPtr <- function(doc, no_escapes, raw) {
.Call(`_openxlsx2_printXPtr`, doc, no_escapes, raw)
}

write_xml_file <- function(xml_content, fl, escapes) {
invisible(.Call(`_openxlsx2_write_xml_file`, xml_content, fl, escapes))
write_xml_file <- function(xml_content, escapes) {
.Call(`_openxlsx2_write_xml_file`, xml_content, escapes)
}

#' adds or updates attribute(s) in existing xml node
Expand Down Expand Up @@ -305,7 +305,11 @@ set_sst <- function(sharedStrings) {
.Call(`_openxlsx2_set_sst`, sharedStrings)
}

write_worksheet <- function(prior, post, sheet_data, cols_attr, R_fileName, is_utf8) {
invisible(.Call(`_openxlsx2_write_worksheet`, prior, post, sheet_data, cols_attr, R_fileName, is_utf8))
write_worksheet <- function(prior, post, sheet_data) {
.Call(`_openxlsx2_write_worksheet`, prior, post, sheet_data)
}

write_xmlPtr <- function(doc, fl) {
invisible(.Call(`_openxlsx2_write_xmlPtr`, doc, fl))
}

1 change: 0 additions & 1 deletion R/class-hyperlink.R
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,6 @@ xml_to_hyperlink <- function(xml) {
names <- lapply(a, function(i) regmatches(i, regexpr('[a-zA-Z]+(?=\\=".*?")', i, perl = TRUE)))
vals <- lapply(a, function(i) {
res <- regmatches(i, regexpr('(?<=").*?(?=")', i, perl = TRUE))
Encoding(res) <- "UTF-8"
res
})

Expand Down
10 changes: 4 additions & 6 deletions R/class-workbook.R
Original file line number Diff line number Diff line change
Expand Up @@ -1436,7 +1436,6 @@ wbWorkbook <- R6::R6Class(
"fileRecoveryPr", "webPublishObjects", "extLst"
)


write_file(
head = '<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x15 xr xr6 xr10 xr2" xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6" xmlns:xr10="http://schemas.microsoft.com/office/spreadsheetml/2016/revision10" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2">',
body = pxml(workbookXML[workbook_openxml281]),
Expand Down Expand Up @@ -4373,14 +4372,13 @@ wbWorkbook <- R6::R6Class(
# # restore order
# ws$sheet_data$row_attr <- row_attr[wanted]

write_worksheet(
# create entire sheet prior to writing it
sheet_xml <- write_worksheet(
prior = prior,
post = post,
sheet_data = ws$sheet_data,
cols_attr = ws$cols_attr,
R_fileName = file.path(xlworksheetsDir, sprintf("sheet%s.xml", i)),
is_utf8 = l10n_info()[["UTF-8"]]
sheet_data = ws$sheet_data
)
write_xmlPtr(doc = sheet_xml, fl = file.path(xlworksheetsDir, sprintf("sheet%s.xml", i)))

## write worksheet rels
if (length(self$worksheets_rels[[i]])) {
Expand Down
3 changes: 1 addition & 2 deletions R/class-worksheet.R
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,7 @@ wbWorksheet <- R6::R6Class(
if (length(self$cols_attr)) {
paste(c("<cols>", self$cols_attr, "</cols>"), collapse = "")
},
'</worksheet>',
sep = ""
)
},
Expand Down Expand Up @@ -374,8 +375,6 @@ wbWorksheet <- R6::R6Class(
)
},

"</worksheet>",

# end
sep = ""
)
Expand Down
6 changes: 0 additions & 6 deletions R/helperFunctions.R
Original file line number Diff line number Diff line change
Expand Up @@ -213,14 +213,8 @@ illegalcharsreplace <- c("&amp;", "&quot;", "&apos;", "&lt;", "&gt;", "", "", ""
#' @keywords internal
#' @noRd
replaceIllegalCharacters <- function(v) {
vEnc <- Encoding(v)
v <- as.character(v)

flg <- vEnc != "UTF-8"
if (any(flg)) {
v[flg] <- stri_conv(v[flg], from = "", to = "UTF-8")
}

v <- stri_replace_all_fixed(v, illegalchars, illegalcharsreplace, vectorize_all = FALSE)

return(v)
Expand Down
2 changes: 1 addition & 1 deletion R/openxlsx2.R
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
#' @import R6
#' @importFrom grDevices bmp col2rgb colours dev.copy dev.list dev.off jpeg png rgb tiff
#' @importFrom magrittr %>%
#' @importFrom stringi stri_c stri_conv stri_isempty stri_join stri_match stri_match_all_regex stri_replace_all_fixed stri_split_fixed stri_split_regex stri_sub
#' @importFrom stringi stri_c stri_isempty stri_join stri_match stri_match_all_regex stri_read_lines stri_replace_all_fixed stri_split_fixed stri_split_regex stri_sub
#' @importFrom utils download.file head menu read.csv unzip
#' @importFrom zip zip
#'
Expand Down
17 changes: 11 additions & 6 deletions R/pugixml.R
Original file line number Diff line number Diff line change
Expand Up @@ -46,15 +46,19 @@ read_xml <- function(xml, pointer = TRUE, escapes = FALSE, declaration = FALSE)
isfile <- TRUE

if (!isfile)
xml <- paste0(xml, collapse = "")
xml <- stringi::stri_join(xml, collapse = "")

if (identical(xml, ""))
xml <- "<NA_character_ />"


# for non utf8 systems, read xml input files as utf8
utf8 <- isTRUE(l10n_info()[["UTF-8"]])

if (pointer) {
z <- readXMLPtr(xml, isfile, escapes, declaration)
z <- readXMLPtr(xml, isfile, escapes, declaration, utf8)
} else {
z <- readXML(xml, isfile, escapes, declaration)
z <- readXML(xml, isfile, escapes, declaration, utf8)
}

z
Expand Down Expand Up @@ -248,8 +252,9 @@ as_xml <- function(x) {
#' @export
# TODO needs a unit test
write_file <- function(head = "", body = "", tail = "", fl = "", escapes = FALSE) {
xml_content <- paste0(head, body, tail, collapse = "")
write_xml_file(xml_content = xml_content, fl = fl, escapes = escapes)
xml_content <- stringi::stri_join(head, body, tail, collapse = "")
xml_content <- write_xml_file(xml_content = xml_content, escapes = escapes)
write_xmlPtr(xml_content, fl)
}

#' append xml child to node
Expand Down Expand Up @@ -292,7 +297,7 @@ xml_add_child <- function(xml_node, xml_child, level, pointer = FALSE, escapes =

if (length(level) == 2)
z <- xml_append_child3(xml_node, xml_child, level[[1]], level[[2]], pointer, escapes)

}

return(z)
Expand Down
2 changes: 1 addition & 1 deletion R/wb_load.R
Original file line number Diff line number Diff line change
Expand Up @@ -916,7 +916,7 @@ wb_load <- function(file, xlsxFile = NULL, sheet, data_only = FALSE) {
ind <- grepl(target, vmlDrawingXML)

if (any(ind)) {
vml <- paste(readLines(vmlDrawingXML[ind], encoding = "UTF-8", warn = FALSE), sep = "", collapse = "")
vml <- paste(stringi::stri_read_lines(vmlDrawingXML[ind], encoding = "UTF-8"), sep = "", collapse = "")
wb$vml[[i]] <- read_xml(gsub("<br>", "<br/>", vml), pointer = FALSE)

relsInd <- grepl(target, vmlDrawingRelsXML)
Expand Down
54 changes: 33 additions & 21 deletions src/RcppExports.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -153,30 +153,32 @@ BEGIN_RCPP
END_RCPP
}
// readXMLPtr
SEXP readXMLPtr(std::string path, bool isfile, bool escapes, bool declaration);
RcppExport SEXP _openxlsx2_readXMLPtr(SEXP pathSEXP, SEXP isfileSEXP, SEXP escapesSEXP, SEXP declarationSEXP) {
SEXP readXMLPtr(std::string path, bool isfile, bool escapes, bool declaration, bool utf8);
RcppExport SEXP _openxlsx2_readXMLPtr(SEXP pathSEXP, SEXP isfileSEXP, SEXP escapesSEXP, SEXP declarationSEXP, SEXP utf8SEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< std::string >::type path(pathSEXP);
Rcpp::traits::input_parameter< bool >::type isfile(isfileSEXP);
Rcpp::traits::input_parameter< bool >::type escapes(escapesSEXP);
Rcpp::traits::input_parameter< bool >::type declaration(declarationSEXP);
rcpp_result_gen = Rcpp::wrap(readXMLPtr(path, isfile, escapes, declaration));
Rcpp::traits::input_parameter< bool >::type utf8(utf8SEXP);
rcpp_result_gen = Rcpp::wrap(readXMLPtr(path, isfile, escapes, declaration, utf8));
return rcpp_result_gen;
END_RCPP
}
// readXML
SEXP readXML(std::string path, bool isfile, bool escapes, bool declaration);
RcppExport SEXP _openxlsx2_readXML(SEXP pathSEXP, SEXP isfileSEXP, SEXP escapesSEXP, SEXP declarationSEXP) {
SEXP readXML(std::string path, bool isfile, bool escapes, bool declaration, bool utf8);
RcppExport SEXP _openxlsx2_readXML(SEXP pathSEXP, SEXP isfileSEXP, SEXP escapesSEXP, SEXP declarationSEXP, SEXP utf8SEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< std::string >::type path(pathSEXP);
Rcpp::traits::input_parameter< bool >::type isfile(isfileSEXP);
Rcpp::traits::input_parameter< bool >::type escapes(escapesSEXP);
Rcpp::traits::input_parameter< bool >::type declaration(declarationSEXP);
rcpp_result_gen = Rcpp::wrap(readXML(path, isfile, escapes, declaration));
Rcpp::traits::input_parameter< bool >::type utf8(utf8SEXP);
rcpp_result_gen = Rcpp::wrap(readXML(path, isfile, escapes, declaration, utf8));
return rcpp_result_gen;
END_RCPP
}
Expand Down Expand Up @@ -416,15 +418,15 @@ BEGIN_RCPP
END_RCPP
}
// write_xml_file
void write_xml_file(std::string xml_content, std::string fl, bool escapes);
RcppExport SEXP _openxlsx2_write_xml_file(SEXP xml_contentSEXP, SEXP flSEXP, SEXP escapesSEXP) {
XPtrXML write_xml_file(std::string xml_content, bool escapes);
RcppExport SEXP _openxlsx2_write_xml_file(SEXP xml_contentSEXP, SEXP escapesSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< std::string >::type xml_content(xml_contentSEXP);
Rcpp::traits::input_parameter< std::string >::type fl(flSEXP);
Rcpp::traits::input_parameter< bool >::type escapes(escapesSEXP);
write_xml_file(xml_content, fl, escapes);
return R_NilValue;
rcpp_result_gen = Rcpp::wrap(write_xml_file(xml_content, escapes));
return rcpp_result_gen;
END_RCPP
}
// xml_attr_mod
Expand Down Expand Up @@ -759,17 +761,26 @@ BEGIN_RCPP
END_RCPP
}
// write_worksheet
void write_worksheet(std::string prior, std::string post, Rcpp::Environment sheet_data, Rcpp::CharacterVector cols_attr, std::string R_fileName, bool is_utf8);
RcppExport SEXP _openxlsx2_write_worksheet(SEXP priorSEXP, SEXP postSEXP, SEXP sheet_dataSEXP, SEXP cols_attrSEXP, SEXP R_fileNameSEXP, SEXP is_utf8SEXP) {
XPtrXML write_worksheet(std::string prior, std::string post, Rcpp::Environment sheet_data);
RcppExport SEXP _openxlsx2_write_worksheet(SEXP priorSEXP, SEXP postSEXP, SEXP sheet_dataSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< std::string >::type prior(priorSEXP);
Rcpp::traits::input_parameter< std::string >::type post(postSEXP);
Rcpp::traits::input_parameter< Rcpp::Environment >::type sheet_data(sheet_dataSEXP);
Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type cols_attr(cols_attrSEXP);
Rcpp::traits::input_parameter< std::string >::type R_fileName(R_fileNameSEXP);
Rcpp::traits::input_parameter< bool >::type is_utf8(is_utf8SEXP);
write_worksheet(prior, post, sheet_data, cols_attr, R_fileName, is_utf8);
rcpp_result_gen = Rcpp::wrap(write_worksheet(prior, post, sheet_data));
return rcpp_result_gen;
END_RCPP
}
// write_xmlPtr
void write_xmlPtr(XPtrXML doc, std::string fl);
RcppExport SEXP _openxlsx2_write_xmlPtr(SEXP docSEXP, SEXP flSEXP) {
BEGIN_RCPP
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< XPtrXML >::type doc(docSEXP);
Rcpp::traits::input_parameter< std::string >::type fl(flSEXP);
write_xmlPtr(doc, fl);
return R_NilValue;
END_RCPP
}
Expand All @@ -787,8 +798,8 @@ static const R_CallMethodDef CallEntries[] = {
{"_openxlsx2_col_to_df", (DL_FUNC) &_openxlsx2_col_to_df, 1},
{"_openxlsx2_df_to_xml", (DL_FUNC) &_openxlsx2_df_to_xml, 2},
{"_openxlsx2_loadvals", (DL_FUNC) &_openxlsx2_loadvals, 2},
{"_openxlsx2_readXMLPtr", (DL_FUNC) &_openxlsx2_readXMLPtr, 4},
{"_openxlsx2_readXML", (DL_FUNC) &_openxlsx2_readXML, 4},
{"_openxlsx2_readXMLPtr", (DL_FUNC) &_openxlsx2_readXMLPtr, 5},
{"_openxlsx2_readXML", (DL_FUNC) &_openxlsx2_readXML, 5},
{"_openxlsx2_getXMLXPtrName1", (DL_FUNC) &_openxlsx2_getXMLXPtrName1, 1},
{"_openxlsx2_getXMLXPtrName2", (DL_FUNC) &_openxlsx2_getXMLXPtrName2, 2},
{"_openxlsx2_getXMLXPtrName3", (DL_FUNC) &_openxlsx2_getXMLXPtrName3, 3},
Expand All @@ -807,7 +818,7 @@ static const R_CallMethodDef CallEntries[] = {
{"_openxlsx2_getXMLXPtr3attr", (DL_FUNC) &_openxlsx2_getXMLXPtr3attr, 4},
{"_openxlsx2_getXMLXPtr4attr", (DL_FUNC) &_openxlsx2_getXMLXPtr4attr, 5},
{"_openxlsx2_printXPtr", (DL_FUNC) &_openxlsx2_printXPtr, 3},
{"_openxlsx2_write_xml_file", (DL_FUNC) &_openxlsx2_write_xml_file, 3},
{"_openxlsx2_write_xml_file", (DL_FUNC) &_openxlsx2_write_xml_file, 2},
{"_openxlsx2_xml_attr_mod", (DL_FUNC) &_openxlsx2_xml_attr_mod, 4},
{"_openxlsx2_xml_node_create", (DL_FUNC) &_openxlsx2_xml_node_create, 5},
{"_openxlsx2_xml_append_child1", (DL_FUNC) &_openxlsx2_xml_append_child1, 4},
Expand Down Expand Up @@ -836,7 +847,8 @@ static const R_CallMethodDef CallEntries[] = {
{"_openxlsx2_read_colors", (DL_FUNC) &_openxlsx2_read_colors, 1},
{"_openxlsx2_write_colors", (DL_FUNC) &_openxlsx2_write_colors, 1},
{"_openxlsx2_set_sst", (DL_FUNC) &_openxlsx2_set_sst, 1},
{"_openxlsx2_write_worksheet", (DL_FUNC) &_openxlsx2_write_worksheet, 6},
{"_openxlsx2_write_worksheet", (DL_FUNC) &_openxlsx2_write_worksheet, 3},
{"_openxlsx2_write_xmlPtr", (DL_FUNC) &_openxlsx2_write_xmlPtr, 2},
{NULL, NULL, 0}
};

Expand Down
4 changes: 2 additions & 2 deletions src/helper_functions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ std::string int_to_col(uint32_t cell) {

// driver function for col_to_int
uint32_t uint_col_to_int(std::string& a) {

char A = 'A';
int aVal = (int)A - 1;
int sum = 0;
Expand Down Expand Up @@ -186,7 +186,7 @@ SEXP rbindlist(Rcpp::List x) {
auto find_res = unique_names.find(names[j]);
auto mtc = std::distance(unique_names.begin(), find_res);

Rcpp::as<Rcpp::CharacterVector>(df[mtc])[i] = values[j];
Rcpp::as<Rcpp::CharacterVector>(df[mtc])[i] = Rcpp::String(values[j]);
}

}
Expand Down
3 changes: 2 additions & 1 deletion src/load_workbook.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,7 @@ Rcpp::DataFrame row_to_df(XPtrXML doc) {
void loadvals(Rcpp::Environment sheet_data, XPtrXML doc) {

auto ws = doc->child("worksheet").child("sheetData");
bool utf8 = Rcpp::as<bool>(doc.attr("is_utf8"));

// character
Rcpp::DataFrame row_attributes;
Expand Down Expand Up @@ -289,7 +290,7 @@ void loadvals(Rcpp::Environment sheet_data, XPtrXML doc) {
// <is>
if (val_name == is_str) {
std::ostringstream oss;
val.print(oss, " ", pugi::format_raw);
val.print(oss, " ", pugi::format_raw, is_utf8(utf8));
single_xml_col.is = oss.str();
} // </is>

Expand Down
18 changes: 7 additions & 11 deletions src/openxlsx2.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,11 @@ SEXP is_to_txt(Rcpp::CharacterVector is_vec);

std::string txt_to_is(std::string txt, bool no_escapes, bool raw);


template <typename T>
inline T Riconv(T &mystring) {
Rcpp::Environment base("package:base");
Rcpp::Function iconv = base["iconv"];

mystring = Rcpp::as<T>(
iconv(mystring, Rcpp::Named("from", ""), Rcpp::Named("to","UTF-8"))
);

return(mystring);
// check if we are running in a latin1 or utf8 encoding
inline pugi::xml_encoding is_utf8(bool utf8) {
if (utf8) {
return(pugi::encoding_utf8);
} else {
return(pugi::encoding_latin1);
}
}
Loading

0 comments on commit 587b96e

Please sign in to comment.