fix-save-to-file-as-stream #489

ducquangkstn · 2019-09-27T03:42:37Z

PR Details

Change the write method from writing to a bytes.Buffer to writing a stream to io.Writer

Description

create a zip.Writer from io.Writer
for each components, create a new file then using a xml.Encoder() and write to io.Writer
using custom unmarshalXML to avoid using this block of code


func replaceRelationshipsBytes(content []byte) []byte {
	oldXmlns := []byte(`xmlns:relationships="http://schemas.openxmlformats.org/officeDocument/2006/relationships" relationships`)
	newXmlns := []byte("r")
	return bytes.Replace(content, oldXmlns, newXmlns, -1)
}

Related Issue

#487

Motivation and Context

improve performance and reduce memory for storing XML data

How Has This Been Tested

writing a unit test for comparing custom unmarshal function and bytes.Replace result

Types of changes

Docs change / refactoring / dependency upgrade
[] Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

xuri · 2019-10-01T16:29:06Z

xmlWorksheet.go

@@ -43,11 +45,114 @@ type xlsxWorksheet struct {
 	ExtLst                *xlsxExtLst                  `xml:"extLst"`
 }

+// MarshalXML implements xml.Marshaler


Invalid attribute and incorrect namespace order in workbooks, worksheets, and style, which will cause the file to be corrupted when creating a new file

these attributes and namespace are replace this function replaceRelationshipsBytes(replaceWorkSheetsRelationshipsNameSpaceBytes(output))
in https://github.com/360EntSecGroup-Skylar/excelize/blob/master/sheet.go#L110

I test this code by create a new file then extract as zip file. Using two branch mine and master get the same results, which means the output files are identical

also this is a better way to replace the function replaceRelationshipsBytes

@xuri pls re-review the PR or give me more comments

Thanks for your PR. A lot of code in this PR, I am maintaining this project in my spare time, I need some time to review.

file.go

xmlStyles.go

xmlStyles_test.go

mlh758 · 2019-10-24T00:50:19Z

lib.go

+	return encoder.Encode(data)
+}
+
+// writeStringToZipWriter writes string to zip,Writer


Extra comma in the comment here

checking grammer in comment :p

I know it's a little trivial but clean comments help everyone who uses the code

comment updated

file.go

mlh758 · 2019-10-24T01:06:41Z

xmlWorkbook.go

@@ -44,6 +44,70 @@ type xlsxWorkbook struct {
 	FileRecoveryPr      *xlsxFileRecoveryPr      `xml:"fileRecoveryPr"`
 }

+// MarshalXML implements xml.Marshaler
+func (x xlsxWorkbook) MarshalXML(e *xml.Encoder, start xml.StartElement) error {
+	x2 := struct {


Would it be too costly to add these fields used in the custom marshal functions to the normal structs we already have and then populate the extra fields during a write? It seems like this exposes us to potential bugs where we might add other fields to things like xlsxWorkbook and forget to add them to these structs here and be very confused as to why they aren't appearing in the final output.

xmlWorkbook.go

mlh758 · 2019-10-24T01:13:28Z

Since this is primarily a performance PR, it would be helpful if you included some benchmarks in the test files to show the actual difference and prevent future regressions as changes are made.

codecov-io · 2019-10-24T07:53:10Z

Codecov Report

Merging #489 into master will decrease coverage by 0.71%.
The diff coverage is 61.48%.

@@            Coverage Diff             @@
##           master     #489      +/-   ##
==========================================
- Coverage    97.1%   96.39%   -0.72%     
==========================================
  Files          28       29       +1     
  Lines        6078     6128      +50     
==========================================
+ Hits         5902     5907       +5     
- Misses         93      117      +24     
- Partials       83      104      +21

Impacted Files	Coverage Δ
xmlWorksheet.go	`100% <ø> (ø)`	⬆️
excelize.go	`95.5% <100%> (ø)`	⬆️
calcchain.go	`100% <100%> (ø)`	⬆️
pivotTable.go	`92.8% <100%> (ø)`	⬆️
styles.go	`98.58% <100%> (ø)`	⬆️
stream.go	`86.95% <100%> (+0.04%)`	⬆️
xmlUtils.go	`100% <100%> (ø)`
cell.go	`95.94% <100%> (ø)`	⬆️
table.go	`93.49% <100%> (ø)`	⬆️
file.go	`62.33% <22.22%> (-25.73%)`	⬇️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ca7231...7848184. Read the comment docs.

mlh758 · 2019-10-29T17:38:55Z

I created a benchmark here and ran it against master and this branch.

Master
BenchmarkWrite-8 2 574756323 ns/op 207278252 B/op 1893776 allocs/op

PR
BenchmarkWrite-8 2 548271059 ns/op 164345284 B/op 1893817 allocs/op

I also ran it against my branch here that I brought up in #494 which has a narrower scope of changes.

BenchmarkWrite-8 2 517713714 ns/op 127115252 B/op 1833746 allocs/op

We can probably combine both changes but there is clearly some benefit to targeting the most allocation heavy functions when going after memory issues.

mlh758 · 2019-10-29T18:06:55Z

Edit: Disregard all of this comment, I misunderstood what was possible with the zip library!

Would we achieve even more memory gains by replacing the File.XLSX map of map[string][]byte with a struct that wraps a zip file? Since it implements logic for reading and writing we could pass files in and out of it and let the zip object inside re-compress them when we're done with the files.

The struct could provide a similar interface to a map making the updates easier throughout the library. That should avoid the current issue of essentially storing an additional copy of the file in memory. saveFileList seems like the ideal place to provide write access into the zip file.

We could even provide some configuration options to flush the file contents to a temp file on disk while we are working with the file so we don't have to hold that in memory the whole time either.

ducquangkstn · 2019-10-31T10:40:30Z

@mlh758 Currently, excelize is storing data at both map[string][]byte and zip.Writer and zip.Writer is built-in lib and optimize (I think)
I have test the memory using this code

func TestFile_Save(t *testing.T) {
	//t.Skip()
	done := make(chan struct{})
	go func() {
		for {
			select {
			case <-done:
				return
			case <-time.Tick(time.Millisecond * 500):
				PrintMemUsage()
			}
		}
	}()

	f := excelize.NewFile()
	for s := 0; s < 5; s++ {
		sheetName := fmt.Sprint("sheet", s+1)
		f.NewSheet(sheetName)
		for _, col := range []string{"A", "B", "C", "D"} {
			for i := 0; i < 100000; i++ {
				require.NoError(t, f.SetCellValue(sheetName, fmt.Sprint(col, i+1), i))
			}
		}

	}
	require.NoError(t, f.SaveAs("test.xlsx"))
	PrintMemUsage()

	done <- struct{}{}
}

// PrintMemUsage outputs the current, total and OS memory being used. As well as the number
// of garage collection cycles completed.
func PrintMemUsage() {
	var m runtime.MemStats
	runtime.ReadMemStats(&m)
	// For info on each, see: https://golang.org/pkg/runtime/#MemStats
	fmt.Printf("Alloc = %v MiB", bToMb(m.Alloc))
	fmt.Printf("\tTotalAlloc = %v MiB", bToMb(m.TotalAlloc))
	fmt.Printf("\tSys = %v MiB", bToMb(m.Sys))
	fmt.Printf("\tNumGC = %v\n", m.NumGC)
}

func bToMb(b uint64) uint64 {
	return b / 1024 / 1024
}

the result is the PR can save about 15% memory.

mlh758 · 2019-10-31T13:52:09Z

You can use Go's built in tools for profiling and memory statistics:

go test -benchmem -run='github.com/360EntSecGroup-Skylar/excelize/v2' -bench BenchmarkWrite -memprofile memprofile.out

go tool pprof -nodefraction=0.1 -png memprofile.out > prchange.png

That runs the benchmark I posted in my comment above and profiles the memory usage. The pprof command trims out smaller functions to clean up the visual and creates an image showing consumption. Here is what your PR looks like, if you're curious:

What I was getting at in my comment is that most of the memory consumed is in smaller operations repeated many times that perform excessive allocations. The copying at write is definitely a good target for optimization, and I like the idea of getting rid of the replaceRelationshipsNameSpaceBytes type functions because that will let us stream files more effectively, but when you're going after memory issues it's a good idea to profile first and then go after the low hanging fruit. As it stands right now this PR reduces total memory consumed but increases allocations and CPU time. I suspect some of that is due to the anonymous structs and having to copy data across to them.

Could you add the xml fields to the base structs we already have, and change the fields before we write instead? That would keep us from having to copy, and also likely keep us from needing custom Marshal functions which I suspect is the big culprit of the slow down. Custom marshal functions can have some surprising side effects.

mlh758 · 2019-11-01T00:42:02Z

Also I've been tinkering with trying to use just a zip archive for storage of serialized files for the lifetime of the excel object but I'm running into a lot of issues with modifying the archive. You can write the same file multiple times into a zip archive and removing a file means you have to copy the whole archive except for that one file.

Checking for the existence of a file being added first (and avoiding the copy most of the time a file is written to the archive) leads to some solid performance gains but it is also leading to some subtle bugs where the write buffer is not a valid zip archive and can't be read.

If I could figure out the write buffer issue it would be nice to combine it with the work you're doing in this PR to pass everything around as streams all the time.

fix-save-to-file fix-save-to-file fix-save-to-file fix-save-to-file fix-save-to-file fix-save-to-file 1

ducquangkstn · 2020-01-06T16:23:06Z

@mlh758 : Recently, I found the way to pass any default xml to a struct like xmlWorkbook
5dc36b7#diff-50b2cf86b4bb003458a82d3e89d0036a
PTAL

jimsmart · 2020-04-01T01:17:44Z

What is the status of this please?

We have quite an interest in specifically reducing the maximum total memory consumed when saving a file.

We would prefer overall memory saving over speed gains: our app runs on a server, where, for us, the tasks are not time critical but overall memory usage is.

xuri · 2020-04-01T02:08:16Z

Hi @jimsmart, I have added a stream writer for generating a new worksheet with huge amounts of data. This PR contains a lot of code and I need some time to review.

jimsmart · 2020-04-01T13:30:40Z

Ok. Thanks for the info.

ducquangkstn · 2020-04-03T08:08:02Z

@xuri: There are other methods to improve performance like change from string field to Stringer field
https://godoc.org/golang.org/x/tools/cmd/stringer like xlsxC.T and xlsxC.V.
But the change might be big. Not sure I should create a PR

xuri self-requested a review September 27, 2019 10:48

xuri reviewed Oct 1, 2019

View reviewed changes

ducquangkstn requested a review from xuri October 14, 2019 16:24

mlh758 reviewed Oct 23, 2019

View reviewed changes

file.go Outdated Show resolved Hide resolved

mlh758 reviewed Oct 23, 2019

View reviewed changes

xmlStyles.go Outdated Show resolved Hide resolved

mlh758 reviewed Oct 23, 2019

View reviewed changes

xmlStyles_test.go Outdated Show resolved Hide resolved

mlh758 reviewed Oct 24, 2019

View reviewed changes

file.go Show resolved Hide resolved

mlh758 reviewed Oct 24, 2019

View reviewed changes

xmlWorkbook.go Outdated Show resolved Hide resolved

mlh758 reviewed Oct 24, 2019

View reviewed changes

xmlWorkbook.go Outdated Show resolved Hide resolved

ducquangkstn changed the title ~~fix-save-to-file-as-stream~~ [WIP] fix-save-to-file-as-stream Oct 24, 2019

xuri added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Dec 18, 2019

ducquangkstn added 5 commits January 4, 2020 11:23

fix-save-to-file

40c5833

fix-save-to-file fix-save-to-file fix-save-to-file fix-save-to-file fix-save-to-file fix-save-to-file 1

change as requested

836eb9f

change as requested

177c07c

change as requested

d83b7e1

change as requested

3bd7edb

ducquangkstn changed the title ~~[WIP] fix-save-to-file-as-stream~~ fix-save-to-file-as-stream Jan 6, 2020

ducquangkstn added 2 commits January 7, 2020 00:09

change custom marshal xml way

ee00acf

update comment

7848184

xuri force-pushed the master branch 2 times, most recently from 159d8f2 to c3e92a5 Compare August 15, 2020 09:19

xuri mentioned this pull request Oct 3, 2020

How to read excel in cursor mode, that is, reduce memory reading in batches? #646

Closed

xuri force-pushed the master branch from 961fbbe to 02530e8 Compare October 16, 2020 17:18

ducquangkstn closed this Oct 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix-save-to-file-as-stream #489

fix-save-to-file-as-stream #489

ducquangkstn commented Sep 27, 2019

xuri Oct 1, 2019

ducquangkstn Oct 8, 2019 •

edited

ducquangkstn Oct 8, 2019

ducquangkstn Oct 18, 2019

xuri Oct 18, 2019 •

edited

mlh758 Oct 24, 2019

chaudharisuresh997 Dec 31, 2019

mlh758 Jan 6, 2020

ducquangkstn Jan 7, 2020

mlh758 Oct 24, 2019

mlh758 commented Oct 24, 2019

codecov-io commented Oct 24, 2019 •

edited

mlh758 commented Oct 29, 2019

mlh758 commented Oct 29, 2019 •

edited

ducquangkstn commented Oct 31, 2019

mlh758 commented Oct 31, 2019

mlh758 commented Nov 1, 2019

ducquangkstn commented Jan 6, 2020

jimsmart commented Apr 1, 2020

xuri commented Apr 1, 2020

jimsmart commented Apr 1, 2020

ducquangkstn commented Apr 3, 2020

fix-save-to-file-as-stream #489

fix-save-to-file-as-stream #489

Conversation

ducquangkstn commented Sep 27, 2019

PR Details

Description

Related Issue

Motivation and Context

How Has This Been Tested

Types of changes

Checklist

Choose a reason for hiding this comment

ducquangkstn Oct 8, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuri Oct 18, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlh758 commented Oct 24, 2019

codecov-io commented Oct 24, 2019 • edited

Codecov Report

mlh758 commented Oct 29, 2019

mlh758 commented Oct 29, 2019 • edited

ducquangkstn commented Oct 31, 2019

mlh758 commented Oct 31, 2019

mlh758 commented Nov 1, 2019

ducquangkstn commented Jan 6, 2020

jimsmart commented Apr 1, 2020

xuri commented Apr 1, 2020

jimsmart commented Apr 1, 2020

ducquangkstn commented Apr 3, 2020

ducquangkstn Oct 8, 2019 •

edited

xuri Oct 18, 2019 •

edited

codecov-io commented Oct 24, 2019 •

edited

mlh758 commented Oct 29, 2019 •

edited