Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thinking about more complex plugins #31

Closed
cameronneylon opened this issue Jun 29, 2013 · 1 comment
Closed

Thinking about more complex plugins #31

cameronneylon opened this issue Jun 29, 2013 · 1 comment
Assignees
Labels

Comments

@cameronneylon
Copy link
Contributor

I am in the process of looking at plugins for Wiley, Springer, and will move onto Taylor and Francis shortly. The SpringerLink case has shown up an interesting issue. If you look at the article:

http://link.springer.com/article/10.1007%2Fs11207-013-0275-y

There is no information more than the little orange badge which doesn't tell us anything useful except that it should be accessible. There are articles with that badge with differing or no license as far as I can tell.

But in some cases if you go through to the full text such as:

http://link.springer.com/article/10.1007%2Fs11207-013-0275-y/fulltext.html

...you will see a license statement there before the references.

So all this is fine - I can run self.simple_extract with the original URL and if it returns a "publisher-asserted-accessible" result then test it with inferred fulltext url. The question here is how best to handle that process.

My thought is to take the plugin skeleton as is, which calls the self.simple_matcher method. Test the record object after this to see if it contains a publisher-asserted-accessible license. If so, create a new record, infer the full text URL, and send the new record to self.simple_matcher. If the new record comes back with a CC BY license then replace the old with the new. The other option is to create a copy of the old record, send the original to be rematched with the new URL and then replace if we don't get anything back on the second cycle.

Something like:

        if self.supports_url(url):
            self.simple_extract(lic_statements, record, url)

        if record['bibjson']['license']['type'] = "publisher-asserted-accessible":
            temp_record = #not sure how to create a new record
            url = url + '/fulltext.html'
            self.simple_extract(lic_statements, temp_record, url)
            if temp_record['bibjson']['license']['type'] = "cc-by":
                record['bibjson']['license'] = temp_record['bibjson']['license']

Just checking if there is a better way to do this. It feels a little hacky macking copies, checking them against each other and then over-writing.

@ghost ghost assigned emanuil-tolev Jun 29, 2013
@emanuil-tolev
Copy link
Contributor

Creating new "basic" record so simple_extract can groove with it:

record = {}
record['bibjson'] = {}
record['provider'] = {}
record['provider']['url'] = [url] # note this is a list, in your case just with 1 item

Long answer:

Plugins aren't necessarily restricted to using the generic methods (such as self.simple_extract) - they inherit them from the base Plugin class for convenience and wrapping up some commonly used functionality, but they can certainly do whatever they want with the content (e.g. see elife which queries an API and runs an XPath expression on the XML, yet produces the same license structure as all other plugins).

It seems to me that you need 2 requests here, necessitated by the way the publisher has structured the page. So you're not getting away from that overhead.

Then it's a question of what you'd like to do with the data both times that you hit the page. So:
1/ Try the usual simple_extract on item that you initially get. See if it's identified as "publisher-asserted-accessible".
2/ If so, query some custom URL ( add "/fulltext.html" ) and check if the returned text has licenses that this plugin knows about in it.

So, here's a solution: Plugin.simple_extract should be broken down into two methods: try_all_these_license_statements(lic_statements, url) and modify_record_to_include_license(record, license_info) [with different names of course]. Then in step 2 of your logic, you can just call try_all_these_license_statements which you will notice does not require a separate record object.

You're simply unable to use just the part of simple_extract (the string matching) on its own as you need it, that's it really. It's a simple matter of refactoring, one that also affects the elife plugin (currently there's duplicated code which adds the license to the record in Plugin.simple_extract and in the elife plugin).

I've known about this ever since I wrote elife but haven't had reason to do it beyond elife until now. This is one of the first things I'd try to do in phase 2's refactoring part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants