Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gets the same items many times #10

Closed
VladimirAlexiev opened this issue Jul 24, 2015 · 2 comments
Closed

Gets the same items many times #10

VladimirAlexiev opened this issue Jul 24, 2015 · 2 comments

Comments

@VladimirAlexiev
Copy link

http://panic.image.ntua.gr:9876/foodanddrink/oai?verb=ListRecords&set=1003&metadataPrefix=rdf is a small set with 498 items.
http://validator.oaipmh.com/#ListRecords gets that many items.

But when I get them with Catmandu, it iterates forever. It gets the 498 items, and then repeats (not all the items in a loop, it's a more complicated pattern).

The script at the bottom gets all items (you have to abort it at some point):

alinari.pl > alinari.rdf

Then I find the item IDs like this:

perl -ne 'print qq{$1\n} if m{<edm:ProvidedCHO rdf:about="(.*?)"}' alinari.rdf > ids

and analyze them with something like

sort ids|uniq -c|sort -nr|less

In my case different items were repeated 13,12, and 1 times


#!perl -w

use Catmandu::Importer::OAI; # https://metacpan.org/source/HOCHSTEN/Catmandu-OAI-0.08/README
use XML::Struct::Writer; # https://metacpan.org/source/XML::Struct::Writer

my $importer = Catmandu::Importer::OAI->new
  (url => "http://panic.image.ntua.gr:9876/foodanddrink/oai",
   metadataPrefix => "rdf",
   set => 1003,
   handler => "struct", # https://metacpan.org/source/HOCHSTEN/Catmandu-OAI-0.08/lib/Catmandu/Importer/OAI/Parser/raw.pm
   # xslt => "oai-unpack.xsl",
  );

print <<EOF;
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:edm="http://www.europeana.eu/schemas/edm/"
    xmlns:oai="http://www.openarchives.org/OAI/2.0/"
    xmlns:ore="http://www.openarchives.org/ore/terms/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
EOF

my $writer = XML::Struct::Writer->new (to => \*STDOUT); # , xmldecl => 0
$importer->each(sub {
  my $item = shift->{_metadata}[2]; # skip rdf::RDF and its attributes (namespaces)
  #use Data::Dump; dd $item;
  map {$writer->writeElement($_)} @$item; # multiple XML elements
  print "\n";
});

print qq{</rdf:RDF>};
@VladimirAlexiev
Copy link
Author

Maybe that server (MINT OAI) doesn't return unique record/header/identifier. But the pagination (resumption token) protocol works fine, since http://validator.oaipmh.com/#ListRecords gets 498 records

@phochste
Copy link
Member

Hi, yes. I can repeat the same thing here on my development box and will try to find a solution for this.

phochste added a commit that referenced this issue Aug 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants