Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add logic for DIRECTORY_SEPARATOR to account for "/" or "\" in reposi… #432

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

flummingbird
Copy link
Contributor

…tory unique identifier

What does this Pull Request do?

We encountered a problem harvesting OAI, where someone's repository unique identifier had problematic characters IE: '/'

What's new?

We added logic to the src/writers/Oaipmh.php normalize_filename function, that would account for system directory separators. we chose to use hyphens to replace.

How should this be tested?

Set your repository unique identifier in islandora_oai config to http://example.com

Then try to harvest oai-pmh from there.

any sample oai-pmh sample config should work.

  • Does this change require documentation to be updated?
    set_spec was a little hard to figure out. also the OAI example for harvesting from other islandoras, wasn't in the wiki TOC.

Interested parties @mjordan

Tag interested parties using @ mentions

@@ -133,6 +133,8 @@ public function writeMetadataFile($metadata, $path, $overwrite = true)
public function normalizeFilename($string)
{
$string = urldecode($string);
$pattern = '#'.DIRECTORY_SEPARATOR.'#';
Copy link
Collaborator

@mjordan mjordan Jul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We run MIK on Windows, and since DIRECTORY_SEPARATOR on Windows is \, we've been burned using DIRECTORY_SEPARATOR unless the pattern is run through PHP's preg_quote(), since \ is also an escape character in regexes. So I recommend, and have used, something like this:

$pattern = preg_quote('#' . DIRECTORY_SEPARATOR . '#');

I can test with that code if you want, report back, and if it works on Windows and Linux, I'll merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great.

Copy link
Collaborator

@mjordan mjordan Jul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit more end-of-the-day hacking around has produced:

    /**
     * Convert %3A (:) and other characters in filenames into underscores (_).
     */
    public function normalizeFilename($string)
    {
        $string = urldecode($string);
        $bad_chars = [
            ':' => '_',
            DIRECTORY_SEPARATOR => '_',
        ];
        $string = str_replace(array_keys($bad_chars), $bad_chars, $string);
        return $string;
    }

This replaces : in my local tests. Could you test to see if it catches the directory separator? The benefit of str_replace() is that you don't need to worry about nasty cross-platform issues (I hope). Also adding new bad characters is pretty easy.

@mjordan
Copy link
Collaborator

mjordan commented Jul 19, 2017

It's awesome you're using OAI toolchains! I'd love to know more.

@mjordan
Copy link
Collaborator

mjordan commented Jul 19, 2017

@flummingbird WRT you comments about testing, I'll update the docs in response. Thanks for the feedback.

@mjordan
Copy link
Collaborator

mjordan commented Jul 19, 2017

Also, we'd prefer if you open an issue describing the problem you've found before opening a PR. No sweat for this PR though, this is a good catch, thanks for finding it.

@flummingbird
Copy link
Contributor Author

not sure what the WRT stands for... Most of the credit should go to @jpeak5

@mjordan
Copy link
Collaborator

mjordan commented Jul 19, 2017

Sorry, 'with regard to.'

@mjordan
Copy link
Collaborator

mjordan commented Jul 19, 2017

@flummingbird @jpeak5 looking at this a bit more closely, I wonder if we should define a list of potentially problematic characters to replace with _. For example, : and DIRECTORY_SEPARATOR. There may be others as well. We could implement like

$pattern = preg_quote('#(' . DIRECTORY_SEPARATOR . '|:)#');
$string = preg_replace($pattern, '_', $string);

Would this address the situation you encountered?

@jpeak5
Copy link
Contributor

jpeak5 commented Jul 20, 2017 via email

@jpeak5
Copy link
Contributor

jpeak5 commented Jul 21, 2017

Mark,

Your end-of-day hack works just fine in our scenario. I'm curious though about the cross-platform issues you mention with preg_replace vs str_replace: I'd always thought that str_replace was just lighter, simpler, faster and that the preg_* family was for when you needed full regex.


public function normalizeFilename($string) {
    $bad_chars = [
      '#:#', 
      '#'.DIRECTORY_SEPARATOR.'#',
      ];
    return preg_replace($bad_chars, '_', urldecode($string));
  }

Any rate- I'd like to find out more about this OAI workflow, specifically, how, once you have the records, you ingest such that the Islandora items present thumbnails and links back to the original. I'm finally able next week to put some time into looking closely at this part of mik...

We plan to harvest collections from only a single institution to start with, but I'm pretty interested in knowing more about how you or others are leveraging OAI.

Alright- thx as always!

jrp

@mjordan
Copy link
Collaborator

mjordan commented Jul 21, 2017

@jpeak5 Thanks for testing. The cross-platform issue we encountered with preg_replace() was that by using DIRECTORY_SEPARATOR within patterns, on Windows this converted to \, which then acted as an escape character within the pattern, making many matches fail. According to this SO post, you do have to escape some needle characters within str_replace(). So my comment that using it instead of preg_replace() is not completely valid.

On to OAI... MIK's OAI toolchains spit out ordinary Islandora ingest packages, although usually with DC instead of MODS XML files. The OAI toolchains don't directly influence the behavior of the resulting Islandora objects in any special way. If I am understanding what you are describing, if you want the thumbnails of the objects (in collection browse lists or search result sets) to lead users offsite to the originating URLs, I think you'd need to do some stuff within Islandora itself to get that to work. That said, you could probably harvest thumbnails from the source objects to reuse in Islandora if you wanted, using a post-write hook script or a custom OAI filegetter. I'd have to know more about what you'd expect the Islandora objects corresponding to the harvested remote objects would be like to speculate on other solutions.

@mjordan
Copy link
Collaborator

mjordan commented Sep 20, 2017

@jpeak5 sorry to leave this PR open so long, but now that it sounds like you guys are fully back in operation, and that the OAI angle was mentioned in Mike Waugh's email announcing the launch of http://louisianadigitallibrary.org/, we thought we'd ping you. We'd be interested in knowing more about how you integrated the data you retrieved via OAI into the site.

@mjordan
Copy link
Collaborator

mjordan commented Sep 20, 2017

Also, mind if we link to a few of the collections on http://louisianadigitallibrary.org/ under the "Islandora content that has been prepared using MIK" section of the MIK README?

@jpeak5
Copy link
Contributor

jpeak5 commented Sep 20, 2017

Hi Mark,

We've definitely been busy - I'm sorry to say that I'd forgotten all about this!

Definitely no problem to link to the collections.

We are still working on problems with harvesting the last 4 of about 80 Tulane collections, so this story is by no means done (and we have much code cleanup and at least one more PR), but here's a brief sketch of what we've done:

We have used mik/oai to pull MODS + TN for all of the collections and items in Tulane University's standalone Islandora (at their request, of course).

We pulled MODS because it was expedient - we needed a field (containing the source URL) that seemed to be getting stripped out of the DC.

We injected a string into MODS.abstract containing a link back to the item in Tulane's site:
lsulibraries@6613f56#diff-3cb45e06ac756c68d4fc7249c5c6b6e4R41

Then, with some spaghetti, sorely in need of refactor, we alter the display of harvested items in our solr search results: https://github.com/lsulibraries/islandora_namespace_homepage/blob/master/islandora_namespace_homepage.module#L346

When we finish with this round of OAI, I'm hoping that we will fix up our code, PR, and then do some better documentation of what we've done (and how we intend to keep our Tulane items in sync with them...)...

jrp

@mjordan
Copy link
Collaborator

mjordan commented Sep 20, 2017

Then, with some spaghetti, sorely in need of refactor, we alter the display of harvested items in our solr search results: https://github.com/lsulibraries/islandora_namespace_homepage/blob/master/islandora_namespace_homepage.module#L346

Cool - so the thumbnail links back to the object at Tulane. What content model are those objects in your Islandora?

@mjordan
Copy link
Collaborator

mjordan commented Sep 20, 2017

When we finish with this round of OAI, I'm hoping that we will fix up our code, PR, and then do some better documentation of what we've done (and how we intend to keep our Tulane items in sync with them...)...

How you've used MIK to harvest the OAI, and integrated the resulting objects into Islandora, would make a super-mega-awesome Migration Guide for the MIK wiki... 😀, don't you think so @MarcusBarnes?

@MarcusBarnes
Copy link
Owner

Agreed: super-mega-awesome. :)

@flummingbird
Copy link
Contributor Author

Hey Mark!
a few things: All our collections were done with MIK!! Every dang one. Oai-pmh or not.
I elected to use the sp_basic_image cmodel for tulane's items
We'd love to work on the MIK-wiki

@mjordan
Copy link
Collaborator

mjordan commented Sep 20, 2017

@flummingbird neat. I just opened a PR to add links to sample collections.

Nice work guys!

@mjordan mjordan mentioned this pull request Oct 3, 2017
@mjordan
Copy link
Collaborator

mjordan commented Oct 11, 2017

@flummingbird @jpeak5 inspired by your use of OAI and redirecting to a remote object, I've developed a solution pack that does a lot of the same things but treats the local objects as a separate content model. It's in our local gitlab at https://git.lib.sfu.ca/mjordan/islandora_solution_pack_remote_resource. Sorry I didn't let you know back when I was working on it, I got distracted..... anyway, I'd welcome any comments you have before publicizing it more widely. There may be a use case for it here in British Columbia, maybe even within my own library since we still have a couple bespoke repository platforms that we'll probably never fully migrate into our Islandora.

@flummingbird
Copy link
Contributor Author

flummingbird commented Nov 7, 2017 via email

@mjordan
Copy link
Collaborator

mjordan commented Nov 7, 2017

@flummingbird yes, I've been working on this lately to get ready for our local code4lib meetup, where I'm going to do a lightning talk on it.

I've ripped out the --url_list stuff and removed that functionality because the batch code was becoming too complex. To replace the ability to load from a CSV file, I've built a separate script that uses a proper CSV parser. I've combined it and the MIK postwrite hook script into this repo: https://git.lib.sfu.ca/mjordan/islandora_remote_resource_batch_tools. This tool lets you use a Twig template to crate DC or MODS files as well.

In this new workflow, you'd run php convert_csv.php --csv sample_data/test.csv --output_dir /tmp/output using your CSV file, and the output would be the same as the directory scan method the batch module uses. I figured that standardizing the input data used by the batch module would let people build their own tools to produce the data. I hope that this change is consistent with your workflow. I'd be very interested to hear more.

@mjordan
Copy link
Collaborator

mjordan commented Nov 7, 2017

@flummingbird I just spent some time fixing up the CSV tool so that you can give it a spin:

https://git.lib.sfu.ca/mjordan/islandora_remote_resource_batch_tools/tree/master/csv

As I mentioned above, this tool is intended to be a more robust replacement for the --url_list option in the batch module.

@mjordan
Copy link
Collaborator

mjordan commented Nov 8, 2017

Latest...

I've also added a GUI for loading a zip containing the data files, and resolved a few issues. I'll be porting over the remaining issues from gitlab to the GitHub repo soon. I'll be applying all updates to the GitHub versions and deprecating the gitlab repos from now on.

@mjordan
Copy link
Collaborator

mjordan commented Nov 8, 2017

OK, all ported to GitHub other than my notes for the code4lib LT.

@flummingbird
Copy link
Contributor Author

flummingbird commented Nov 8, 2017 via email

@mjordan
Copy link
Collaborator

mjordan commented Nov 27, 2017

@flummingbird I'm going to demo this solution pack at our local code4lib on Friday. If you've had a chance to test it out I'd love to get some feedback. No worries if you haven't though.

@jpeak5
Copy link
Contributor

jpeak5 commented Nov 27, 2017

@mjordan I replied via email, but since it hasn't shown up here after an hour, I'm gonna paste my reply below...


Mark,

I'm sorry I haven't written sooner to say that this is fantastic! - exactly the content type we need.
We looked at this in the week before the Thanksgiving break, and we have added some classes and some drush functionality that will let us harvest directly from the command line.

mjordan/islandora_solution_pack_remote_resource@7.x...lsulibraries:FR-oai

In the short-term, I expect that that is how we will run the ingest, but I have begun planning for a GUI sync framework that will allow management and reporting over collections that use the Remote Resource in addition to automated (likely cron) updates.

Thank you so very much for this, again!
jrp

@mjordan
Copy link
Collaborator

mjordan commented Nov 27, 2017

@jpeak5 awesome. I agree that OAI is a very important source of data for this solution pack. I put together some helper scripts, https://github.com/mjordan/islandora_remote_resource_batch_tools, which include an MIK post-write hook script to convert that output of its OAI crawl to a format that the batch loader could use as input. But, including OAI functionality directly in the batch loader makes a lot of sense and removes the dependency on MIK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants