Skip to content
This repository has been archived by the owner on May 26, 2022. It is now read-only.

Reader doesn't understand which is the last column (ODS file) #561

Open
eugenio11 opened this issue Jun 8, 2018 · 10 comments
Open

Reader doesn't understand which is the last column (ODS file) #561

eugenio11 opened this issue Jun 8, 2018 · 10 comments

Comments

@eugenio11
Copy link

Hello,
for some files (until now, it happened only with ODS files), it seems the reader wrongly see all the 1024 columns as part of the sheet, even if just three columns are used. Basically the reader, from the fourth column on, returns an empty string for every cell.

I am not sure if it's a spout issue and most of the time it doesn't happen, I cannot reproduce the issue starting from a new ODS file but I have an ODS file that triggers the issue, if you need it.

Thanks.

@madflow
Copy link
Contributor

madflow commented Jun 8, 2018

@eugenio11 Yes - please provide the file in an anonymized form if possible

@eugenio11
Copy link
Author

tet.ods.zip

Hello, here is the file

@adrilo
Copy link
Collaborator

adrilo commented Jun 10, 2018

Hi @eugenio11 !

Here is what the content file of the ODS looks like:

<table:table-row table:style-name="ro1">
	<table:table-cell office:value-type="string">
		<text:p>nome</text:p>
	</table:table-cell>
	<table:table-cell office:value-type="string">
		<text:p>cognome</text:p>
	</table:table-cell>
	<table:table-cell office:value-type="string" table:style-name="ce3">
		<text:p>indirizzo</text:p>
	</table:table-cell>
	<table:table-cell table:number-columns-repeated="1021"/>
</table:table-row>

The last table-cell line is what causes the 1000+ columns to be added. The fix should be to ignore the line if it's the last one in the row, and if the value is empty.

This is the part that should be modified:

if ((count($this->currentlyProcessedRowData) + $actualNumColumnsRepeated) !== self::MAX_COLUMNS_EXCEL) {
. The current logic only took into account the specific Excel behavior but we should probably make it more generic to avoid adding empty trailing columns

@eugenio11
Copy link
Author

Hello and thanks for the reply.
I am wondering how a file like the one I attached can be produced, I don't remember the steps I did to make it and I am not able to reproduce it.
About your fix, what if all the 1000+ columns are actually used?

@adrilo
Copy link
Collaborator

adrilo commented Jun 11, 2018

Some softwares do add this repeated columns cell for some reasons. Some versions of Excel for instance add this. I'm not sure why...
I made the change but ended up breaking some tests, which reminded me that your point is good. Sometimes, it is the desired behavior to have empty cells at the end of the row.

I'm gonna look at the official specs to see what we should do in this case. Eventually, we could have something a bit magic: if the last cell is empty and repeated more than X times, then do not add it; X being a magic number to be determined. This approach should solve 99% of the use cases and be the cause of bugs in the 1% of cases where the last cell is actually repeated a lot on purpose.

@madflow
Copy link
Contributor

madflow commented Jun 11, 2018

I am wondering how a file like the one I attached can be produced, I don't remember the steps I did to make it and I am not able to reproduce it.

You can reproduce such a file by applying a format to the entire row. Even though only the first cell contains data - LibreOffice at least - will consider the rest of the cells as not empty.

@eugenio11
Copy link
Author

Hello,
thanks!
Yes, the solution you mentioned is the same I wanted to use on my side; I think the best way to implement it could be leaving the correct one as the default behavior and add a parameter that allows to set the "magic number".

@madflow
Copy link
Contributor

madflow commented Jun 11, 2018

Eventually, we could have something a bit magic: if the last cell is empty and repeated more than X times, then do not add it; X being a magic number to be determined.

I would favor introducing the concept of a spreadsheet header or a reader range to Spout.

  • When the first line should be treated as the header - then the complete results will contain the same count of cells - empty or not - defined by the header.

  • Alternatively one could set a reader range (Options::setReaderRange(int $startColumn, int $endColumn)). This would be basically the same like the header concept - but more explicit.

This would probably not work in some use cases, where there is no header and the range is not known. For us - this would solve issues like this. We always use a headerline to define the spreadsheet.

I feel weird about introducing magic number setters to circumvent vendor quirks.

@adrilo
Copy link
Collaborator

adrilo commented Jun 11, 2018

So, I looked at the specs but there's nothing regarding what the behavior should be. This means that a table-cell with 1023 repeated columns actually represents 1024 cells!
The way some libraries handle this is by reading the entire spreadsheet and keeping a count of the max num columns for each row. Unfortunately, we can't do that...

So @eugenio11, I'm afraid we can't fix this behavior for now. I guess if you know that your spreadsheet is very small, you can store all the data in memory and do the counting of max num columns yourself to find the actual dimensions of the spreadsheet.

@madflow I like your suggestion about using the range (and probably defaulting the start column to 0). If you can change that even after the reading started, one could read the header, count its number of columns and set the range. That way, there's no need to have specific code to handle header row.

@eugenio11
Copy link
Author

@adrilo thanks for the info; I guess we can implement the "magic number" feature on our side without reading the entire spreadsheet, just the first (or the second, if the first defines labels) row.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants