New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ProQuest integration #1520
Add ProQuest integration #1520
Conversation
a6b4730
to
df4c097
Compare
df4c097
to
ed06095
Compare
ee44625
to
59a5ba5
Compare
59a5ba5
to
3934a1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this implementation will work but I have a couple comments and there are two places where I'm concerned redundant code and/or serious performance issues:
First, it looks like there's no way to stop a ProQuest import from retrieving every single book in the collection every time the script is run. Assuming that's not by design, working within OPDS2Importer._get_feeds
should give you a framework for improving things.
Second, the implementation of CirculationAPI.patron_activity can probably be removed altogether, though it's possible I don't understand the rules regarding who is in charge of keeping track of the loans.
Finally, your new scripts need to be added to docker/services/simplified_crontab. How often they need to be run I don't know -- in particular, proquest_importer needs to be run infrequently (once a week at most) if it really does need to fetch the entire collection on every run.
|
||
return feed | ||
|
||
def download_all_feed_pages(self, db): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very similar to the logic in OPDSImportMonitor._get_feeds, which you override. I think if you made the current page an item of object state, you could override extract_next_links instead, so that it always returned a crafted URL that points to the next page.
I mention this not only because using the existing _get_feeds
implementation would simplify code but because I believe overriding get_feeds
prevents feed_contains_new_data
from ever being run. Because of this I don't see how this monitor would ever stop before downloading the entire list of books from ProQuest.
If this is the only way to do it (e.g. because ProQuest's OPDS feeds don't include books in any particular order) then I still think using the existing _get_feeds
implementation would be more elegant, but they will probably end up with the same behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that ProQuest's OPDS feed doesn't contain next
links and we have to iterate through it using the API.
Also, unfortunately, we have to download the whole feed all the time because only by doing this we will be able to handle removals. There is no logic for it yet, I'll be adding it a bit later. It will basically contain the following steps we discussed before:
- Create a set S1 of all current ProQuest IDs
- Walk through the whole feed and save all the IDs from the feed in a separate set S2
- Subtract S2 from S1, the resulting difference will contain IDs which needs to be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though the feed doesn't contain next
links, you know how to construct the "link that should be processed next". I think you could do that work inside next_links
, instead of making next_links
a no-op and doing the work elsewhere.
Having to download the whole feed every time is unfortunate. How big is this feed likely to be? My concern is that a network error at any point in the process will invalidate the entire operation, so it may never actually complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, it's not enough to just construct URL and pass it as a next_link
. The ProQuest's API returns a JSON document containing an OPDS 2.0 feed as a field inside this document. ProQuestAPIClient
contains special logic parsing API's responses and validating them. That's why I would like to keep the feed downloading logic there instead of extending OPDSImporter
.
The feed can be quite large, at the moment it's around 38 Mb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My advice based on experience with other APIs is it's better to download a single document, even if it takes a really long time, than to try to download 100 smaller documents.
return loan | ||
except BaseError as exception: | ||
raise CannotLoan(six.ensure_text(str(exception))) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly writing this down because I had to look it up: if the CM is in charge of keeping track of the loans there is no need to implement checkin
as anything but a no-op. CirculationAPI.checkin
will take care of deleting the local loan from the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'm following, this method is checkout
and there is no checkin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no checkin
implementation in this class, so you inherit from BaseCirculationAPI.checkin
, which is a no-op.
CirculationAPI
manages the overall progress of a checkout or checkin operation. It delegates to a BaseCirculationAPI
subclass (like ProquestOPDS2Importer) when it's time to actually communicate with an external API. Everything apart from the external communication (like managing the loans
and holds
table) is done inside CirculationAPI
. This is why the BaseCirculationAPI
subclasses create LoanInfo
and HoldInfo
objects--it's so you don't have to manage the database directly.
All of this is fine, I was just writing down the context.
|
||
def _get_feeds(self): | ||
for feed in self._client.download_all_feed_pages(self._db): | ||
yield None, feed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned earlier I think you can improve this code by working within the OPDS2ImportMonitor._get_feeds
implementation rather than overriding it.
) | ||
|
||
# 4. Assert that ProQuestCredentialManager.save_proquest_token | ||
# was called when CM tried to save the token created in step 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The narrative used in this test is easy to follow, thank you.
d3e62c4
to
61e8836
Compare
…irculation into feature/proquest-integration
api/proquest/importer.py
Outdated
|
||
self._logger = logging.getLogger(__name__) | ||
|
||
def _parse_feed(self, feed, silent=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a snippet from OPDS2Importer
class. I'll refactor it when I'll be working on the code allowing to set up a custom classifier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you go into more detail about a custom classifier? I don't think import is the appropriate place to set up a custom classifier, because import only happens once and classification happens many times over the lifespan of a book, as the classification rules improve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this comment. Sure, you're absolutely right. I meant that I wanted to do this refactoring (removing duplicated code) together with changing the default classifier used for the ProQuest feed to LCSH because both of them require changes in server_core
and I would wan't to wait that long
) | ||
) | ||
|
||
return ProQuestBook(content=bytes(response.content)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really hope we can improve this later. Loading an entire book into memory is a problem, since there's no limit on the size and it could crash the server. Proxying an entire request is also a problem, because it blocks the thread from handling other incoming requests.
Using response.iter_content(1024*16)
instead of response.content
might be an easy way to solve the first problem.
Is there any way to give the client the information necessary to make this request? We have done similar things in the past by creating custom media types that just contain a URL and a bearer token.
|
||
while True: | ||
try: | ||
book = self._api_client.get_book(self._db, token, document_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I say 'can the client make the request', this is the request I'm talking about. Can we send token
to the end-user and tell them to make this request, or does that pose a security problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client can't make a request, unfortunately, because ProQuest API is available only for specific whitelisted IP addresses
Overall this looks good now. There are some issues but we can fix them later since to start with Lyrasis is the only party that will run into the issues. I'm not going to 'ready to merge' yet for two reasons:
Since I'm much further away from approving #1523, you'll need to build a custom image to do a demo that involves ProQuest and SAML. I'll take a look at this on Monday with an eye towards merging, but since you need to build a custom image anyway, hopefully it won't upset your plans if this branch, like 1523, isn't merged for a while. |
@leonardr, I created new PR # 1219 in server_core which is required for this branch. |
:return: Identifier object | ||
:rtype: Identifier | ||
""" | ||
return parse_identifier(self._db, identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still being used? Since it's a private method I think you can remove it after changing to parse_identifier
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, you use it in the core branch.
I'm assuming you got the short-circuit to work by changing the way identifiers are parsed so that OPDSImporter.feed_contains_new_data gives the right result. Is that right? |
@leonardr, yes, short-circuit is supposed to work and I added tests verifying that it works correctly. However, the problem is that ProQuest don't sort the feed by |
Having re-familiarized myself with #1520 after a break, I see a few outstanding problems:
We can't do anything about the first issue. It's OK to punt the fourth issue until later; we just need to file a separate ticket for it. The second and third issues are related. If I'm right, the short-circuit code you added should be removed, since it cases the CM to do the wrong thing, rather than to do the right thing too slowly. At that point we can file a separate ticket for "Proquest import is way too slow" and figure out how to proceed from there. |
A couple questions regarding the Proquest feed:
I'm guessing most of the 30-40 hour runtime of this process is |
@leonardr , you're right about short-circuit that it won't work correctly in the current situation when the feed is not ordered. I was thinking about using Regarding the need to stream open-access book I added SIMPLY-3343. ProQuest might change their API and drop IP whitelisting for Regarding your last two questions:
|
If ProQuest is only accepting connections from certain IPs then I agree there's no alternative to the CM proxying the entire book. However it should be possible for the CM to stream the book to the client rather than reading the whole thing into memory. Most academic PDFs are small, but an art history book could be over 100 megabytes. I filed https://jira.nypl.org/browse/SIMPLY-3344 to cover this. |
Remove the short-circuit and I think we can merge this; the remaining issues have been moved into separate tickets. |
@leonardr, thank you, I commented out the short-circuit and added a comment mentioning SIMPLY-3343 |
Description
This PR adds a new descendant of
OPDS2Importer
-ProQuestOPDS2Importer
. This new class allows to import OPDS 2.0 feeds into Circulation Manager and to download books using ProQuest API.Corresponding
core
changes are in PR # 1208 and PR # 1219.Motivation and Context
SIMPLY-3251
How Has This Been Tested?
Checklist: