Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix-save-to-file-as-stream #489

Closed
wants to merge 7 commits into from
Closed

fix-save-to-file-as-stream #489

wants to merge 7 commits into from

Conversation

ducquangkstn
Copy link
Contributor

PR Details

  • Change the write method from writing to a bytes.Buffer to writing a stream to io.Writer

Description

  • create a zip.Writer from io.Writer
  • for each components, create a new file then using a xml.Encoder() and write to io.Writer
  • using custom unmarshalXML to avoid using this block of code

func replaceRelationshipsBytes(content []byte) []byte {
	oldXmlns := []byte(`xmlns:relationships="http://schemas.openxmlformats.org/officeDocument/2006/relationships" relationships`)
	newXmlns := []byte("r")
	return bytes.Replace(content, oldXmlns, newXmlns, -1)
}

Related Issue

#487

Motivation and Context

improve performance and reduce memory for storing XML data

How Has This Been Tested

  • writing a unit test for comparing custom unmarshal function and bytes.Replace result

Types of changes

  • Docs change / refactoring / dependency upgrade
  • [] Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@xuri xuri self-requested a review September 27, 2019 10:48
xmlWorksheet.go Outdated
@@ -43,11 +45,114 @@ type xlsxWorksheet struct {
ExtLst *xlsxExtLst `xml:"extLst"`
}

// MarshalXML implements xml.Marshaler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid attribute and incorrect namespace order in workbooks, worksheets, and style, which will cause the file to be corrupted when creating a new file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • these attributes and namespace are replace this function replaceRelationshipsBytes(replaceWorkSheetsRelationshipsNameSpaceBytes(output))
    in https://github.com/360EntSecGroup-Skylar/excelize/blob/master/sheet.go#L110
  • I test this code by create a new file then extract as zip file. Using two branch mine and master get the same results, which means the output files are identical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this is a better way to replace the function replaceRelationshipsBytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuri pls re-review the PR or give me more comments

Copy link
Member

@xuri xuri Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR. A lot of code in this PR, I am maintaining this project in my spare time, I need some time to review.

file.go Outdated Show resolved Hide resolved
xmlStyles.go Outdated Show resolved Hide resolved
xmlStyles_test.go Outdated Show resolved Hide resolved
lib.go Outdated
return encoder.Encode(data)
}

// writeStringToZipWriter writes string to zip,Writer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra comma in the comment here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking grammer in comment :p

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's a little trivial but clean comments help everyone who uses the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment updated

file.go Show resolved Hide resolved
xmlWorkbook.go Outdated
@@ -44,6 +44,70 @@ type xlsxWorkbook struct {
FileRecoveryPr *xlsxFileRecoveryPr `xml:"fileRecoveryPr"`
}

// MarshalXML implements xml.Marshaler
func (x xlsxWorkbook) MarshalXML(e *xml.Encoder, start xml.StartElement) error {
x2 := struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be too costly to add these fields used in the custom marshal functions to the normal structs we already have and then populate the extra fields during a write? It seems like this exposes us to potential bugs where we might add other fields to things like xlsxWorkbook and forget to add them to these structs here and be very confused as to why they aren't appearing in the final output.

xmlWorkbook.go Outdated Show resolved Hide resolved
xmlWorkbook.go Outdated Show resolved Hide resolved
@mlh758
Copy link
Contributor

mlh758 commented Oct 24, 2019

Since this is primarily a performance PR, it would be helpful if you included some benchmarks in the test files to show the actual difference and prevent future regressions as changes are made.

@codecov-io
Copy link

codecov-io commented Oct 24, 2019

Codecov Report

Merging #489 into master will decrease coverage by 0.71%.
The diff coverage is 61.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #489      +/-   ##
==========================================
- Coverage    97.1%   96.39%   -0.72%     
==========================================
  Files          28       29       +1     
  Lines        6078     6128      +50     
==========================================
+ Hits         5902     5907       +5     
- Misses         93      117      +24     
- Partials       83      104      +21
Impacted Files Coverage Δ
xmlWorksheet.go 100% <ø> (ø) ⬆️
excelize.go 95.5% <100%> (ø) ⬆️
calcchain.go 100% <100%> (ø) ⬆️
pivotTable.go 92.8% <100%> (ø) ⬆️
styles.go 98.58% <100%> (ø) ⬆️
stream.go 86.95% <100%> (+0.04%) ⬆️
xmlUtils.go 100% <100%> (ø)
cell.go 95.94% <100%> (ø) ⬆️
table.go 93.49% <100%> (ø) ⬆️
file.go 62.33% <22.22%> (-25.73%) ⬇️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ca7231...7848184. Read the comment docs.

@ducquangkstn ducquangkstn changed the title fix-save-to-file-as-stream [WIP] fix-save-to-file-as-stream Oct 24, 2019
@mlh758
Copy link
Contributor

mlh758 commented Oct 29, 2019

I created a benchmark here and ran it against master and this branch.

Master
BenchmarkWrite-8 2 574756323 ns/op 207278252 B/op 1893776 allocs/op

PR
BenchmarkWrite-8 2 548271059 ns/op 164345284 B/op 1893817 allocs/op

I also ran it against my branch here that I brought up in #494 which has a narrower scope of changes.

BenchmarkWrite-8 2 517713714 ns/op 127115252 B/op 1833746 allocs/op

We can probably combine both changes but there is clearly some benefit to targeting the most allocation heavy functions when going after memory issues.

@mlh758
Copy link
Contributor

mlh758 commented Oct 29, 2019

Edit: Disregard all of this comment, I misunderstood what was possible with the zip library!

Would we achieve even more memory gains by replacing the File.XLSX map of map[string][]byte with a struct that wraps a zip file? Since it implements logic for reading and writing we could pass files in and out of it and let the zip object inside re-compress them when we're done with the files.

The struct could provide a similar interface to a map making the updates easier throughout the library. That should avoid the current issue of essentially storing an additional copy of the file in memory. saveFileList seems like the ideal place to provide write access into the zip file.

We could even provide some configuration options to flush the file contents to a temp file on disk while we are working with the file so we don't have to hold that in memory the whole time either.

@ducquangkstn
Copy link
Contributor Author

@mlh758 Currently, excelize is storing data at both map[string][]byte and zip.Writer and zip.Writer is built-in lib and optimize (I think)
I have test the memory using this code

func TestFile_Save(t *testing.T) {
	//t.Skip()
	done := make(chan struct{})
	go func() {
		for {
			select {
			case <-done:
				return
			case <-time.Tick(time.Millisecond * 500):
				PrintMemUsage()
			}
		}
	}()

	f := excelize.NewFile()
	for s := 0; s < 5; s++ {
		sheetName := fmt.Sprint("sheet", s+1)
		f.NewSheet(sheetName)
		for _, col := range []string{"A", "B", "C", "D"} {
			for i := 0; i < 100000; i++ {
				require.NoError(t, f.SetCellValue(sheetName, fmt.Sprint(col, i+1), i))
			}
		}

	}
	require.NoError(t, f.SaveAs("test.xlsx"))
	PrintMemUsage()

	done <- struct{}{}
}

// PrintMemUsage outputs the current, total and OS memory being used. As well as the number
// of garage collection cycles completed.
func PrintMemUsage() {
	var m runtime.MemStats
	runtime.ReadMemStats(&m)
	// For info on each, see: https://golang.org/pkg/runtime/#MemStats
	fmt.Printf("Alloc = %v MiB", bToMb(m.Alloc))
	fmt.Printf("\tTotalAlloc = %v MiB", bToMb(m.TotalAlloc))
	fmt.Printf("\tSys = %v MiB", bToMb(m.Sys))
	fmt.Printf("\tNumGC = %v\n", m.NumGC)
}

func bToMb(b uint64) uint64 {
	return b / 1024 / 1024
}

the result is the PR can save about 15% memory.

@mlh758
Copy link
Contributor

mlh758 commented Oct 31, 2019

You can use Go's built in tools for profiling and memory statistics:

go test -benchmem -run='github.com/360EntSecGroup-Skylar/excelize/v2' -bench BenchmarkWrite -memprofile memprofile.out

go tool pprof -nodefraction=0.1 -png memprofile.out > prchange.png

That runs the benchmark I posted in my comment above and profiles the memory usage. The pprof command trims out smaller functions to clean up the visual and creates an image showing consumption. Here is what your PR looks like, if you're curious:

prchange

What I was getting at in my comment is that most of the memory consumed is in smaller operations repeated many times that perform excessive allocations. The copying at write is definitely a good target for optimization, and I like the idea of getting rid of the replaceRelationshipsNameSpaceBytes type functions because that will let us stream files more effectively, but when you're going after memory issues it's a good idea to profile first and then go after the low hanging fruit. As it stands right now this PR reduces total memory consumed but increases allocations and CPU time. I suspect some of that is due to the anonymous structs and having to copy data across to them.

Could you add the xml fields to the base structs we already have, and change the fields before we write instead? That would keep us from having to copy, and also likely keep us from needing custom Marshal functions which I suspect is the big culprit of the slow down. Custom marshal functions can have some surprising side effects.

@mlh758
Copy link
Contributor

mlh758 commented Nov 1, 2019

Also I've been tinkering with trying to use just a zip archive for storage of serialized files for the lifetime of the excel object but I'm running into a lot of issues with modifying the archive. You can write the same file multiple times into a zip archive and removing a file means you have to copy the whole archive except for that one file.

Checking for the existence of a file being added first (and avoiding the copy most of the time a file is written to the archive) leads to some solid performance gains but it is also leading to some subtle bugs where the write buffer is not a valid zip archive and can't be read.

If I could figure out the write buffer issue it would be nice to combine it with the work you're doing in this PR to pass everything around as streams all the time.

@xuri xuri added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Dec 18, 2019
fix-save-to-file

fix-save-to-file

fix-save-to-file

fix-save-to-file

fix-save-to-file

fix-save-to-file

1
@ducquangkstn
Copy link
Contributor Author

@mlh758 : Recently, I found the way to pass any default xml to a struct like xmlWorkbook
5dc36b7#diff-50b2cf86b4bb003458a82d3e89d0036a
PTAL

@ducquangkstn ducquangkstn changed the title [WIP] fix-save-to-file-as-stream fix-save-to-file-as-stream Jan 6, 2020
@jimsmart
Copy link

jimsmart commented Apr 1, 2020

What is the status of this please?

We have quite an interest in specifically reducing the maximum total memory consumed when saving a file.

We would prefer overall memory saving over speed gains: our app runs on a server, where, for us, the tasks are not time critical but overall memory usage is.

@xuri
Copy link
Member

xuri commented Apr 1, 2020

Hi @jimsmart, I have added a stream writer for generating a new worksheet with huge amounts of data. This PR contains a lot of code and I need some time to review.

@jimsmart
Copy link

jimsmart commented Apr 1, 2020

Ok. Thanks for the info.

@ducquangkstn
Copy link
Contributor Author

@xuri: There are other methods to improve performance like change from string field to Stringer field
https://godoc.org/golang.org/x/tools/cmd/stringer like xlsxC.T and xlsxC.V.
But the change might be big. Not sure I should create a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants