Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLSX files: Add Excel::ValueReader::XLSX backend #50

Closed
shawnlaffan opened this issue Oct 23, 2023 · 7 comments
Closed

XLSX files: Add Excel::ValueReader::XLSX backend #50

shawnlaffan opened this issue Oct 23, 2023 · 7 comments

Comments

@shawnlaffan
Copy link

It would be useful to have Excel::ValueReader::XLSX as a non-default parser option.

It only extracts values so is not as feature complete as the other parsers, but in many cases it is only the data that are wanted.

It is substantially faster than the current XLSX backends and might also be useful for #49.

Proof of concept code is below. This builds a structure that reasonably closely matches what SpreadSheet::Read generates using Spreadsheet::ParseXLSX. It is missing some metadata and the cell references like A3, BB72 but these could be added. It does match the cell data structure.

use 5.036;
use strict;
use warnings;

use Data::Printer;

use Spreadsheet::Read;
use Excel::ValueReader::XLSX;

my $file = 'r1.xlsx';  #  update as needed

my $book_SR  = ReadData ($file);
my $book_EVX = EVX ($file);
p $book_SR;
p $book_EVX;

sub EVX {
    my $file = shift;
    my $reader = Excel::ValueReader::XLSX->new($file);
    my @sheet_names = $reader->sheet_names;
    my %sheet_ids;
    @sheet_ids{@sheet_names} = (1 .. @sheet_names);
    my $workbook = [
        {
            error  => undef,
            parser => "Excel::ValueReader::XLSX",
            sheet  => \%sheet_ids,
            sheets => scalar @sheet_names,
            type   => 'xlsx',
        },
    ];

    foreach my $sheet_name ($reader->sheet_names) {

        my $grid = $reader->values($sheet_name);

        #  Transpose to column vectors.
        #  The A1, B5 etc items could be added here as well.
        my @t = ([]); #  first entry is empty
        foreach my $r (0 .. $#$grid) {
            my $row = $grid->[$r];
            foreach my $c (0 .. $#$row) {
                #  add 1 for array base 1
                $t[$c + 1][$r + 1] = $grid->[$r][$c];
            }
        }

        #  First entry in @t is padding so number of items
        #  is the max index.
        my $maxcol = $#t;

        #  No padding of first entry in $grid so 
        #  number of items is the array length.
        my $maxrow = @$grid;

        my $sheet = {
            label  => $sheet_name,
            cell   => \@t,
            maxrow => $maxrow,
            maxcol => $maxcol,
            minrow => 1,
            mincol => 1,
            indx   => 1,
            merged => [],
        };
        push @$workbook, $sheet;
    }

    return $workbook;
}
@Tux
Copy link
Owner

Tux commented Nov 1, 2023

Could you pull and see if 73312b1 tickles your needs?

t/630_xlsx.t ...... # Parser: Excel::ValueReader::XLSX-1.10
t/630_xlsx.t ...... ok
t/631_clr.t ....... ok
t/632_fmt.t ....... ok
t/633_misc.t ...... ok
t/634_dates.t ..... ok
t/635_perc.t ...... ok
t/637_merged.t .... ok

@shawnlaffan
Copy link
Author

Thanks for doing this. Unfortunately it seems not to be working.

Tested using Strawberry Perl 5.38.0.1.

Save the script below as check.pl at the top level of the repo, run perl Makefile.PL && gmake and then perl -Mblib check.pl

use 5.010;
use strict;
use warnings;

use Excel::ValueReader::XLSX;
use Spreadsheet::Read;

my $file = 'files/Active2.xlsx';

my $book = ReadData($file, debug => 1000, parser => 'Excel::ValueReader::XLSX');

Call results:

perl -Mblib check.pl
$Options = {
             'rc' => 1,
             'parser' => 'Excel::ValueReader::XLSX',
             'merge' => 0,
             'attr' => 0,
             'clip' => 1,
             'cells' => 1,
             'debug' => 1000,
             'strip' => 0,
             'dtfmt' => 'yyyy-mm-dd'
           };
<<- check.pl:10|Spreadsheet::Read::ReadData
I can open file files/Active2.xlsx, but I do not know how to parse it
 at check.pl line 10.

@Tux
Copy link
Owner

Tux commented Nov 2, 2023

You misinterpreted the parser option/attribute: that does not define the backend, but the document type :)

use Spreadsheet::Read;
BEGIN { $ENV{SPREADSHEET_READ_XLSX} = "Excel::ValueReader::XLSX"; } # <- force backend

my $file = "files/Active2.xlsx";
my $book = ReadData ($file, debug => 1000, parser => "xlsx"); # parser option optional here

The parser option is available to force a type on a document or stream where Spreadsheet::Read would otherwise not be able to determine what to do. Missing file extension, empty file, something else.
I'll see if I can make the docs more clear about that and also check if a new option backend is possible.

@shawnlaffan
Copy link
Author

I did indeed misinterpret the parser option. Clarifying the help system would be very helpful.

It would also be handy to be able to specify the backend for flexibility. Currently it needs to be set before the use Spreadsheet::Read call, which bakes it in. Setting local $ENV{SPREADSHEET_READ_XLSX} = 'Excel::ValueReader::XLSX'; has no effect so it cannot currently be overridden.

@Tux
Copy link
Owner

Tux commented Nov 2, 2023

Waiting for the tests to finish: Spreadsheet::Read 1183fcf support this:

    parser
      Force the data to be parsed by a specific format. Possible values are
      "csv", "prl" (or "perl"), "sc" (or "squirrelcalc"), "sxc" (or "oo",
      "ods", "openoffice", "libreoffice") "xls" (or "excel"), and "xlsx" (or
      "excel2007").

      When parsing streams, instead of files, it is highly recommended to
      pass this option.

      Spreadsheet::Read supports several underlying parsers per spreadsheet
      type. It will try those from most favored to least favored. When you
      have a good reason to prefer a different parser, you can set that in
      environment variables. The other options then will not be tested for:

       env SPREADSHEET_READ_CSV=Text::CSV_PP ...

      You can also directly pass the required backend, forcing the matching
      type, but this excludes version checking.

       # Checks for minimal version
       BEGIN { $ENV{SPREADSHEET_READ_CSV} = "Text::CSV_PP" }
       my $book = ReadData ("test.csv", parser => "csv");

       vs

       # NO check for minimal version
       my $book = ReadData ("test.csv", parser => "Text::CSV_PP");

@shawnlaffan
Copy link
Author

Thanks. This looks to do the trick.

@Tux
Copy link
Owner

Tux commented Nov 7, 2023

0.88 uploaded to CPAN

@Tux Tux closed this as completed Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants