Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ProQuest integration #1520

Merged

Conversation

vbessonov
Copy link
Contributor

@vbessonov vbessonov commented Nov 8, 2020

Description

This PR adds a new descendant of OPDS2Importer - ProQuestOPDS2Importer. This new class allows to import OPDS 2.0 feeds into Circulation Manager and to download books using ProQuest API.

Corresponding core changes are in PR # 1208 and PR # 1219.

Motivation and Context

SIMPLY-3251

How Has This Been Tested?

Checklist:

  • I have updated the documentation accordingly.
  • All new and existing tests passed.

@vbessonov vbessonov force-pushed the feature/proquest-integration branch 2 times, most recently from a6b4730 to df4c097 Compare November 8, 2020 20:47
@vbessonov vbessonov closed this Nov 13, 2020
@vbessonov vbessonov reopened this Nov 13, 2020
Copy link
Contributor

@leonardr leonardr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this implementation will work but I have a couple comments and there are two places where I'm concerned redundant code and/or serious performance issues:

First, it looks like there's no way to stop a ProQuest import from retrieving every single book in the collection every time the script is run. Assuming that's not by design, working within OPDS2Importer._get_feeds should give you a framework for improving things.

Second, the implementation of CirculationAPI.patron_activity can probably be removed altogether, though it's possible I don't understand the rules regarding who is in charge of keeping track of the loans.

Finally, your new scripts need to be added to docker/services/simplified_crontab. How often they need to be run I don't know -- in particular, proquest_importer needs to be run infrequently (once a week at most) if it really does need to fetch the entire collection on every run.

bin/saml_monitor.py Outdated Show resolved Hide resolved
bin/proquest_import_monitor.py Outdated Show resolved Hide resolved
api/proquest/client.py Outdated Show resolved Hide resolved
api/proquest/importer.py Show resolved Hide resolved

return feed

def download_all_feed_pages(self, db):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very similar to the logic in OPDSImportMonitor._get_feeds, which you override. I think if you made the current page an item of object state, you could override extract_next_links instead, so that it always returned a crafted URL that points to the next page.

I mention this not only because using the existing _get_feeds implementation would simplify code but because I believe overriding get_feeds prevents feed_contains_new_data from ever being run. Because of this I don't see how this monitor would ever stop before downloading the entire list of books from ProQuest.

If this is the only way to do it (e.g. because ProQuest's OPDS feeds don't include books in any particular order) then I still think using the existing _get_feeds implementation would be more elegant, but they will probably end up with the same behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that ProQuest's OPDS feed doesn't contain next links and we have to iterate through it using the API.
Also, unfortunately, we have to download the whole feed all the time because only by doing this we will be able to handle removals. There is no logic for it yet, I'll be adding it a bit later. It will basically contain the following steps we discussed before:

  1. Create a set S1 of all current ProQuest IDs
  2. Walk through the whole feed and save all the IDs from the feed in a separate set S2
  3. Subtract S2 from S1, the resulting difference will contain IDs which needs to be removed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the feed doesn't contain next links, you know how to construct the "link that should be processed next". I think you could do that work inside next_links, instead of making next_links a no-op and doing the work elsewhere.

Having to download the whole feed every time is unfortunate. How big is this feed likely to be? My concern is that a network error at any point in the process will invalidate the entire operation, so it may never actually complete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it's not enough to just construct URL and pass it as a next_link. The ProQuest's API returns a JSON document containing an OPDS 2.0 feed as a field inside this document. ProQuestAPIClient contains special logic parsing API's responses and validating them. That's why I would like to keep the feed downloading logic there instead of extending OPDSImporter.
The feed can be quite large, at the moment it's around 38 Mb.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My advice based on experience with other APIs is it's better to download a single document, even if it takes a really long time, than to try to download 100 smaller documents.

return loan
except BaseError as exception:
raise CannotLoan(six.ensure_text(str(exception)))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly writing this down because I had to look it up: if the CM is in charge of keeping track of the loans there is no need to implement checkin as anything but a no-op. CirculationAPI.checkin will take care of deleting the local loan from the database.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm following, this method is checkout and there is no checkin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no checkin implementation in this class, so you inherit from BaseCirculationAPI.checkin, which is a no-op.

CirculationAPI manages the overall progress of a checkout or checkin operation. It delegates to a BaseCirculationAPI subclass (like ProquestOPDS2Importer) when it's time to actually communicate with an external API. Everything apart from the external communication (like managing the loans and holds table) is done inside CirculationAPI. This is why the BaseCirculationAPI subclasses create LoanInfo and HoldInfo objects--it's so you don't have to manage the database directly.

All of this is fine, I was just writing down the context.


def _get_feeds(self):
for feed in self._client.download_all_feed_pages(self._db):
yield None, feed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned earlier I think you can improve this code by working within the OPDS2ImportMonitor._get_feeds implementation rather than overriding it.

tests/proquest/test_identifier.py Outdated Show resolved Hide resolved
tests/saml/fixtures.py Outdated Show resolved Hide resolved
)

# 4. Assert that ProQuestCredentialManager.save_proquest_token
# was called when CM tried to save the token created in step 3.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The narrative used in this test is easy to follow, thank you.


self._logger = logging.getLogger(__name__)

def _parse_feed(self, feed, silent=True):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a snippet from OPDS2Importer class. I'll refactor it when I'll be working on the code allowing to set up a custom classifier

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you go into more detail about a custom classifier? I don't think import is the appropriate place to set up a custom classifier, because import only happens once and classification happens many times over the lifespan of a book, as the classification rules improve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this comment. Sure, you're absolutely right. I meant that I wanted to do this refactoring (removing duplicated code) together with changing the default classifier used for the ProQuest feed to LCSH because both of them require changes in server_core and I would wan't to wait that long

api/proquest/client.py Outdated Show resolved Hide resolved
)
)

return ProQuestBook(content=bytes(response.content))
Copy link
Contributor

@leonardr leonardr Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really hope we can improve this later. Loading an entire book into memory is a problem, since there's no limit on the size and it could crash the server. Proxying an entire request is also a problem, because it blocks the thread from handling other incoming requests.

Using response.iter_content(1024*16) instead of response.content might be an easy way to solve the first problem.

Is there any way to give the client the information necessary to make this request? We have done similar things in the past by creating custom media types that just contain a URL and a bearer token.


while True:
try:
book = self._api_client.get_book(self._db, token, document_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I say 'can the client make the request', this is the request I'm talking about. Can we send token to the end-user and tell them to make this request, or does that pose a security problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client can't make a request, unfortunately, because ProQuest API is available only for specific whitelisted IP addresses

@leonardr
Copy link
Contributor

Overall this looks good now. There are some issues but we can fix them later since to start with Lyrasis is the only party that will run into the issues.

I'm not going to 'ready to merge' yet for two reasons:

  1. The test suite seems to be hung or not running, so I don't know if the tests pass. Resolving the conflict with the core submodule might un-stick it.
  2. I want to make sure we at least have a plan and a separate ticket for making the fetch of the actual book the responsibility of the (web or mobile) client. Proxying an ACSM file is fine, because those are tiny; proxying an entire book is a big problem.

Since I'm much further away from approving #1523, you'll need to build a custom image to do a demo that involves ProQuest and SAML. I'll take a look at this on Monday with an eye towards merging, but since you need to build a custom image anyway, hopefully it won't upset your plans if this branch, like 1523, isn't merged for a while.

@vbessonov
Copy link
Contributor Author

@leonardr, I created new PR # 1219 in server_core which is required for this branch.

:return: Identifier object
:rtype: Identifier
"""
return parse_identifier(self._db, identifier)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still being used? Since it's a private method I think you can remove it after changing to parse_identifier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, you use it in the core branch.

@leonardr
Copy link
Contributor

I'm assuming you got the short-circuit to work by changing the way identifiers are parsed so that OPDSImporter.feed_contains_new_data gives the right result. Is that right?

@vbessonov
Copy link
Contributor Author

@leonardr, yes, short-circuit is supposed to work and I added tests verifying that it works correctly. However, the problem is that ProQuest don't sort the feed by modified date. They can't do that because it won't make sense in the case of items bought by the university: those items will have "old" publication and modified dates and will be considered as already processed by Circulation Manager which is not true. It seems that we have to run the import process from scratch every time but since it takes a lot of time (around 35 - 40 hours) there are some huge concerns regarding that it's possible to import the whole feed in one run. And if the run fails, we have to start from scratch again

@leonardr
Copy link
Contributor

Having re-familiarized myself with #1520 after a break, I see a few outstanding problems:

  1. From our perspective, the ProQuest feed is in no particular order
  2. This means the short-circuit you added recently will make the CM stop processing when it ought to keep going. feed_contains_new_data now works correctly, but the underlying assumption about how to use that information has been violated. It's possible for feed_contains_new_data to return False for page N but True for page N+1.
  3. Without the short-circuit, the CM will do the right thing (since feed_contains_new_data will always return True) but it will take an unacceptably long time.
  4. When a patron requests a book, the entire book is streamed through the CM, rather than the CM sending them a link they can use to download the book.

We can't do anything about the first issue.

It's OK to punt the fourth issue until later; we just need to file a separate ticket for it.

The second and third issues are related. If I'm right, the short-circuit code you added should be removed, since it cases the CM to do the wrong thing, rather than to do the right thing too slowly.

At that point we can file a separate ticket for "Proquest import is way too slow" and figure out how to proceed from there.

@leonardr
Copy link
Contributor

A couple questions regarding the Proquest feed:

  • Is the feed in any kind of order? Can we assume that newly purchased and updated items are at the front of the feed, even if the "modified" dates appear to be very old? Or is the feed in no order at all?
  • Is there any information in an entry that we can use to check whether we've processed that entry before? Or do we have to go through the whole process of building a Metadata object and calling Metadata.apply?

I'm guessing most of the 30-40 hour runtime of this process is Metadata.apply calls that ultimately do nothing, and I'm a little pessimistic about being able to optimize that. However we can improve reliability by downloading every page of the ProQuest feed before processing it, and deciding to process a given work only once every few days.

@vbessonov
Copy link
Contributor Author

@leonardr , you're right about short-circuit that it won't work correctly in the current situation when the feed is not ordered. I was thinking about using --force flag but it doesn't make sense because it won't be working correctly with default parameters. I'll comment out the short-circuit logic.

Regarding the need to stream open-access book I added SIMPLY-3343. ProQuest might change their API and drop IP whitelisting for DownloadLink service used for downloading books. This will allow us to move the download part to client applications.

Regarding your last two questions:

  • The feed is currently ordered by the publication date but I'm not sure where new items will show up. For example, those new items can be actually "old" books (their publication date is somewhere in the past) which were recently bought by the library and most likely they'll show up somewhere in the middle of the feed.
  • Currently, there is no information that would allow us to understand that we've already processed the item. I discussed this issue with ProQuest but they don't want to invest more development resources into fixing it and almost agreed on that we can import the feed twice a week.

@leonardr
Copy link
Contributor

leonardr commented Dec 2, 2020

If ProQuest is only accepting connections from certain IPs then I agree there's no alternative to the CM proxying the entire book. However it should be possible for the CM to stream the book to the client rather than reading the whole thing into memory. Most academic PDFs are small, but an art history book could be over 100 megabytes. I filed https://jira.nypl.org/browse/SIMPLY-3344 to cover this.

@leonardr
Copy link
Contributor

leonardr commented Dec 2, 2020

Remove the short-circuit and I think we can merge this; the remaining issues have been moved into separate tickets.

@vbessonov
Copy link
Contributor Author

@leonardr, thank you, I commented out the short-circuit and added a comment mentioning SIMPLY-3343

@leonardr leonardr merged commit b1f0f94 into NYPL-Simplified:develop Dec 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants