New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix-save-to-file-as-stream #489
Conversation
xmlWorksheet.go
Outdated
@@ -43,11 +45,114 @@ type xlsxWorksheet struct { | |||
ExtLst *xlsxExtLst `xml:"extLst"` | |||
} | |||
|
|||
// MarshalXML implements xml.Marshaler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invalid attribute and incorrect namespace order in workbooks, worksheets, and style, which will cause the file to be corrupted when creating a new file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- these attributes and namespace are replace this function replaceRelationshipsBytes(replaceWorkSheetsRelationshipsNameSpaceBytes(output))
in https://github.com/360EntSecGroup-Skylar/excelize/blob/master/sheet.go#L110 - I test this code by create a new file then extract as zip file. Using two branch mine and master get the same results, which means the output files are identical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this is a better way to replace the function replaceRelationshipsBytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xuri pls re-review the PR or give me more comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR. A lot of code in this PR, I am maintaining this project in my spare time, I need some time to review.
lib.go
Outdated
return encoder.Encode(data) | ||
} | ||
|
||
// writeStringToZipWriter writes string to zip,Writer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra comma in the comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking grammer in comment :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's a little trivial but clean comments help everyone who uses the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment updated
xmlWorkbook.go
Outdated
@@ -44,6 +44,70 @@ type xlsxWorkbook struct { | |||
FileRecoveryPr *xlsxFileRecoveryPr `xml:"fileRecoveryPr"` | |||
} | |||
|
|||
// MarshalXML implements xml.Marshaler | |||
func (x xlsxWorkbook) MarshalXML(e *xml.Encoder, start xml.StartElement) error { | |||
x2 := struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be too costly to add these fields used in the custom marshal functions to the normal structs we already have and then populate the extra fields during a write? It seems like this exposes us to potential bugs where we might add other fields to things like xlsxWorkbook
and forget to add them to these structs here and be very confused as to why they aren't appearing in the final output.
Since this is primarily a performance PR, it would be helpful if you included some benchmarks in the test files to show the actual difference and prevent future regressions as changes are made. |
Codecov Report
@@ Coverage Diff @@
## master #489 +/- ##
==========================================
- Coverage 97.1% 96.39% -0.72%
==========================================
Files 28 29 +1
Lines 6078 6128 +50
==========================================
+ Hits 5902 5907 +5
- Misses 93 117 +24
- Partials 83 104 +21
Continue to review full report at Codecov.
|
I created a benchmark here and ran it against master and this branch. Master PR I also ran it against my branch here that I brought up in #494 which has a narrower scope of changes. BenchmarkWrite-8 2 517713714 ns/op 127115252 B/op 1833746 allocs/op We can probably combine both changes but there is clearly some benefit to targeting the most allocation heavy functions when going after memory issues. |
Edit: Disregard all of this comment, I misunderstood what was possible with the zip library! Would we achieve even more memory gains by replacing the File.XLSX map of The struct could provide a similar interface to a map making the updates easier throughout the library. That should avoid the current issue of essentially storing an additional copy of the file in memory. We could even provide some configuration options to flush the file contents to a temp file on disk while we are working with the file so we don't have to hold that in memory the whole time either. |
@mlh758 Currently, excelize is storing data at both map[string][]byte and zip.Writer and zip.Writer is built-in lib and optimize (I think)
the result is the PR can save about 15% memory. |
You can use Go's built in tools for profiling and memory statistics:
That runs the benchmark I posted in my comment above and profiles the memory usage. The pprof command trims out smaller functions to clean up the visual and creates an image showing consumption. Here is what your PR looks like, if you're curious: What I was getting at in my comment is that most of the memory consumed is in smaller operations repeated many times that perform excessive allocations. The copying at write is definitely a good target for optimization, and I like the idea of getting rid of the Could you add the xml fields to the base structs we already have, and change the fields before we write instead? That would keep us from having to copy, and also likely keep us from needing custom Marshal functions which I suspect is the big culprit of the slow down. Custom marshal functions can have some surprising side effects. |
Also I've been tinkering with trying to use just a zip archive for storage of serialized files for the lifetime of the excel object but I'm running into a lot of issues with modifying the archive. You can write the same file multiple times into a zip archive and removing a file means you have to copy the whole archive except for that one file. Checking for the existence of a file being added first (and avoiding the copy most of the time a file is written to the archive) leads to some solid performance gains but it is also leading to some subtle bugs where the write buffer is not a valid zip archive and can't be read. If I could figure out the write buffer issue it would be nice to combine it with the work you're doing in this PR to pass everything around as streams all the time. |
fix-save-to-file fix-save-to-file fix-save-to-file fix-save-to-file fix-save-to-file fix-save-to-file 1
@mlh758 : Recently, I found the way to pass any default xml to a struct like |
What is the status of this please? We have quite an interest in specifically reducing the maximum total memory consumed when saving a file. We would prefer overall memory saving over speed gains: our app runs on a server, where, for us, the tasks are not time critical but overall memory usage is. |
Hi @jimsmart, I have added a stream writer for generating a new worksheet with huge amounts of data. This PR contains a lot of code and I need some time to review. |
Ok. Thanks for the info. |
@xuri: There are other methods to improve performance like change from string field to Stringer field |
159d8f2
to
c3e92a5
Compare
PR Details
Description
Related Issue
#487
Motivation and Context
improve performance and reduce memory for storing XML data
How Has This Been Tested
Types of changes
Checklist