Add ProQuest integration #1520

vbessonov · 2020-11-08T20:06:23Z

Description

This PR adds a new descendant of OPDS2Importer - ProQuestOPDS2Importer. This new class allows to import OPDS 2.0 feeds into Circulation Manager and to download books using ProQuest API.

Corresponding core changes are in PR # 1208 and PR # 1219.

Motivation and Context

SIMPLY-3251

How Has This Been Tested?

Checklist:

I have updated the documentation accordingly.
All new and existing tests passed.

indirectly

api/saml/metadata.py

leonardr

It looks like this implementation will work but I have a couple comments and there are two places where I'm concerned redundant code and/or serious performance issues:

First, it looks like there's no way to stop a ProQuest import from retrieving every single book in the collection every time the script is run. Assuming that's not by design, working within OPDS2Importer._get_feeds should give you a framework for improving things.

Second, the implementation of CirculationAPI.patron_activity can probably be removed altogether, though it's possible I don't understand the rules regarding who is in charge of keeping track of the loans.

Finally, your new scripts need to be added to docker/services/simplified_crontab. How often they need to be run I don't know -- in particular, proquest_importer needs to be run infrequently (once a week at most) if it really does need to fetch the entire collection on every run.

bin/saml_monitor.py

bin/proquest_import_monitor.py

api/proquest/client.py

api/proquest/importer.py

leonardr · 2020-11-16T16:57:59Z

api/proquest/client.py

+
+            return feed
+
+    def download_all_feed_pages(self, db):


This seems very similar to the logic in OPDSImportMonitor._get_feeds, which you override. I think if you made the current page an item of object state, you could override extract_next_links instead, so that it always returned a crafted URL that points to the next page.

I mention this not only because using the existing _get_feeds implementation would simplify code but because I believe overriding get_feeds prevents feed_contains_new_data from ever being run. Because of this I don't see how this monitor would ever stop before downloading the entire list of books from ProQuest.

If this is the only way to do it (e.g. because ProQuest's OPDS feeds don't include books in any particular order) then I still think using the existing _get_feeds implementation would be more elegant, but they will probably end up with the same behavior.

The problem is that ProQuest's OPDS feed doesn't contain next links and we have to iterate through it using the API.
Also, unfortunately, we have to download the whole feed all the time because only by doing this we will be able to handle removals. There is no logic for it yet, I'll be adding it a bit later. It will basically contain the following steps we discussed before:

Create a set S1 of all current ProQuest IDs

Walk through the whole feed and save all the IDs from the feed in a separate set S2

Subtract S2 from S1, the resulting difference will contain IDs which needs to be removed

Even though the feed doesn't contain next links, you know how to construct the "link that should be processed next". I think you could do that work inside next_links, instead of making next_links a no-op and doing the work elsewhere.

Having to download the whole feed every time is unfortunate. How big is this feed likely to be? My concern is that a network error at any point in the process will invalidate the entire operation, so it may never actually complete.

Unfortunately, it's not enough to just construct URL and pass it as a next_link. The ProQuest's API returns a JSON document containing an OPDS 2.0 feed as a field inside this document. ProQuestAPIClient contains special logic parsing API's responses and validating them. That's why I would like to keep the feed downloading logic there instead of extending OPDSImporter.
The feed can be quite large, at the moment it's around 38 Mb.

My advice based on experience with other APIs is it's better to download a single document, even if it takes a really long time, than to try to download 100 smaller documents.

leonardr · 2020-11-16T17:28:02Z

api/proquest/importer.py

+                return loan
+        except BaseError as exception:
+            raise CannotLoan(six.ensure_text(str(exception)))
+


Explicitly writing this down because I had to look it up: if the CM is in charge of keeping track of the loans there is no need to implement checkin as anything but a no-op. CirculationAPI.checkin will take care of deleting the local loan from the database.

I'm not sure I'm following, this method is checkout and there is no checkin

There's no checkin implementation in this class, so you inherit from BaseCirculationAPI.checkin, which is a no-op.

CirculationAPI manages the overall progress of a checkout or checkin operation. It delegates to a BaseCirculationAPI subclass (like ProquestOPDS2Importer) when it's time to actually communicate with an external API. Everything apart from the external communication (like managing the loans and holds table) is done inside CirculationAPI. This is why the BaseCirculationAPI subclasses create LoanInfo and HoldInfo objects--it's so you don't have to manage the database directly.

All of this is fine, I was just writing down the context.

leonardr · 2020-11-16T17:30:13Z

api/proquest/importer.py

+
+    def _get_feeds(self):
+        for feed in self._client.download_all_feed_pages(self._db):
+            yield None, feed


As I mentioned earlier I think you can improve this code by working within the OPDS2ImportMonitor._get_feeds implementation rather than overriding it.

tests/proquest/test_identifier.py

tests/saml/fixtures.py

leonardr · 2020-11-16T17:58:24Z

tests/proquest/test_importer.py

+            )
+
+            # 4. Assert that ProQuestCredentialManager.save_proquest_token
+            # was called when CM tried to save the token created in step 3.


The narrative used in this test is easy to follow, thank you.

indirectly

…irculation into feature/proquest-integration

vbessonov · 2020-11-17T12:31:36Z

api/proquest/importer.py

+
+        self._logger = logging.getLogger(__name__)
+
+    def _parse_feed(self, feed, silent=True):


It's a snippet from OPDS2Importer class. I'll refactor it when I'll be working on the code allowing to set up a custom classifier

Can you go into more detail about a custom classifier? I don't think import is the appropriate place to set up a custom classifier, because import only happens once and classification happens many times over the lifespan of a book, as the classification rules improve.

Sorry, I missed this comment. Sure, you're absolutely right. I meant that I wanted to do this refactoring (removing duplicated code) together with changing the default classifier used for the ProQuest feed to LCSH because both of them require changes in server_core and I would wan't to wait that long

api/proquest/client.py

leonardr · 2020-11-20T19:45:27Z

api/proquest/client.py

+                    )
+                )
+
+                return ProQuestBook(content=bytes(response.content))


I really hope we can improve this later. Loading an entire book into memory is a problem, since there's no limit on the size and it could crash the server. Proxying an entire request is also a problem, because it blocks the thread from handling other incoming requests.

Using response.iter_content(1024*16) instead of response.content might be an easy way to solve the first problem.

Is there any way to give the client the information necessary to make this request? We have done similar things in the past by creating custom media types that just contain a URL and a bearer token.

leonardr · 2020-11-20T19:52:48Z

api/proquest/importer.py

+
+        while True:
+            try:
+                book = self._api_client.get_book(self._db, token, document_id)


When I say 'can the client make the request', this is the request I'm talking about. Can we send token to the end-user and tell them to make this request, or does that pose a security problem?

The client can't make a request, unfortunately, because ProQuest API is available only for specific whitelisted IP addresses

leonardr · 2020-11-20T20:35:37Z

Overall this looks good now. There are some issues but we can fix them later since to start with Lyrasis is the only party that will run into the issues.

I'm not going to 'ready to merge' yet for two reasons:

The test suite seems to be hung or not running, so I don't know if the tests pass. Resolving the conflict with the core submodule might un-stick it.
I want to make sure we at least have a plan and a separate ticket for making the fetch of the actual book the responsibility of the (web or mobile) client. Proxying an ACSM file is fine, because those are tiny; proxying an entire book is a big problem.

Since I'm much further away from approving #1523, you'll need to build a custom image to do a demo that involves ProQuest and SAML. I'll take a look at this on Monday with an eye towards merging, but since you need to build a custom image anyway, hopefully it won't upset your plans if this branch, like 1523, isn't merged for a while.

…ed feeds

vbessonov · 2020-11-23T03:11:51Z

@leonardr, I created new PR # 1219 in server_core which is required for this branch.

…-integration

leonardr · 2020-11-23T14:43:22Z

api/proquest/importer.py

+        :return: Identifier object
+        :rtype: Identifier
+        """
+        return parse_identifier(self._db, identifier)


Is this still being used? Since it's a private method I think you can remove it after changing to parse_identifier.

Never mind, you use it in the core branch.

leonardr · 2020-11-23T15:01:02Z

I'm assuming you got the short-circuit to work by changing the way identifiers are parsed so that OPDSImporter.feed_contains_new_data gives the right result. Is that right?

vbessonov · 2020-11-23T16:21:40Z

@leonardr, yes, short-circuit is supposed to work and I added tests verifying that it works correctly. However, the problem is that ProQuest don't sort the feed by modified date. They can't do that because it won't make sense in the case of items bought by the university: those items will have "old" publication and modified dates and will be considered as already processed by Circulation Manager which is not true. It seems that we have to run the import process from scratch every time but since it takes a lot of time (around 35 - 40 hours) there are some huge concerns regarding that it's possible to import the whole feed in one run. And if the run fails, we have to start from scratch again

leonardr · 2020-11-30T18:00:26Z

Having re-familiarized myself with #1520 after a break, I see a few outstanding problems:

From our perspective, the ProQuest feed is in no particular order
This means the short-circuit you added recently will make the CM stop processing when it ought to keep going. feed_contains_new_data now works correctly, but the underlying assumption about how to use that information has been violated. It's possible for feed_contains_new_data to return False for page N but True for page N+1.
Without the short-circuit, the CM will do the right thing (since feed_contains_new_data will always return True) but it will take an unacceptably long time.
When a patron requests a book, the entire book is streamed through the CM, rather than the CM sending them a link they can use to download the book.

We can't do anything about the first issue.

It's OK to punt the fourth issue until later; we just need to file a separate ticket for it.

The second and third issues are related. If I'm right, the short-circuit code you added should be removed, since it cases the CM to do the wrong thing, rather than to do the right thing too slowly.

At that point we can file a separate ticket for "Proquest import is way too slow" and figure out how to proceed from there.

leonardr · 2020-11-30T19:52:09Z

A couple questions regarding the Proquest feed:

Is the feed in any kind of order? Can we assume that newly purchased and updated items are at the front of the feed, even if the "modified" dates appear to be very old? Or is the feed in no order at all?
Is there any information in an entry that we can use to check whether we've processed that entry before? Or do we have to go through the whole process of building a Metadata object and calling Metadata.apply?

I'm guessing most of the 30-40 hour runtime of this process is Metadata.apply calls that ultimately do nothing, and I'm a little pessimistic about being able to optimize that. However we can improve reliability by downloading every page of the ProQuest feed before processing it, and deciding to process a given work only once every few days.

vbessonov · 2020-12-02T13:13:26Z

@leonardr , you're right about short-circuit that it won't work correctly in the current situation when the feed is not ordered. I was thinking about using --force flag but it doesn't make sense because it won't be working correctly with default parameters. I'll comment out the short-circuit logic.

Regarding the need to stream open-access book I added SIMPLY-3343. ProQuest might change their API and drop IP whitelisting for DownloadLink service used for downloading books. This will allow us to move the download part to client applications.

Regarding your last two questions:

The feed is currently ordered by the publication date but I'm not sure where new items will show up. For example, those new items can be actually "old" books (their publication date is somewhere in the past) which were recently bought by the library and most likely they'll show up somewhere in the middle of the feed.
Currently, there is no information that would allow us to understand that we've already processed the item. I discussed this issue with ProQuest but they don't want to invest more development resources into fixing it and almost agreed on that we can import the feed twice a week.

leonardr · 2020-12-02T13:32:08Z

If ProQuest is only accepting connections from certain IPs then I agree there's no alternative to the CM proxying the entire book. However it should be possible for the CM to stream the book to the client rather than reading the whole thing into memory. Most academic PDFs are small, but an art history book could be over 100 megabytes. I filed https://jira.nypl.org/browse/SIMPLY-3344 to cover this.

leonardr · 2020-12-02T13:48:02Z

Remove the short-circuit and I think we can merge this; the remaining issues have been moved into separate tickets.

vbessonov · 2020-12-02T14:19:57Z

@leonardr, thank you, I commented out the short-circuit and added a comment mentioning SIMPLY-3343

…-integration

vbessonov requested a review from leonardr November 8, 2020 20:06

vbessonov mentioned this pull request Nov 8, 2020

Add ProQuest integration NYPL-Simplified/server_core#1208

Merged

2 tasks

vbessonov force-pushed the feature/proquest-integration branch 2 times, most recently from a6b4730 to df4c097 Compare November 8, 2020 20:47

Add ProQuest integration

ed06095

vbessonov force-pushed the feature/proquest-integration branch from df4c097 to ed06095 Compare November 8, 2020 20:48

vbessonov added 2 commits November 9, 2020 23:51

Do not set up Loan.external_identifier

9309f64

Update core submodule

ee44625

vbessonov closed this Nov 13, 2020

vbessonov reopened this Nov 13, 2020

vbessonov force-pushed the feature/proquest-integration branch from ee44625 to 59a5ba5 Compare November 13, 2020 19:51

Update ProQuestAPIClient.get_book to be able to download ACSM books

3934a1a

indirectly

leonardr reviewed Nov 16, 2020

View reviewed changes

api/saml/metadata.py Show resolved Hide resolved

vbessonov force-pushed the feature/proquest-integration branch from 59a5ba5 to 3934a1a Compare November 16, 2020 14:08

leonardr requested changes Nov 16, 2020

View reviewed changes

vbessonov added 10 commits November 17, 2020 15:30

Address the code review comments

1d374b1

Update docstrings

17beba4

Fix the tests

d3e62c4

Add ProQuest integration

d8ddf77

Do not set up Loan.external_identifier

7f44281

Update core submodule

618a017

Update ProQuestAPIClient.get_book to be able to download ACSM books

d848bdc

indirectly

Address the code review comments

8242ed6

Update docstrings

5256d1d

Fix the tests

61e8836

vbessonov force-pushed the feature/proquest-integration branch from d3e62c4 to 61e8836 Compare November 17, 2020 11:58

vbessonov added 2 commits November 17, 2020 17:28

Merge branch 'feature/proquest-integration' of github.com:vbessonov/c…

b373fad

…irculation into feature/proquest-integration

Fix ProQuestOPDS2ImportMonitor

d0f1256

vbessonov commented Nov 17, 2020

View reviewed changes

Fix Default Audience configuration setting

ed31c6a

vbessonov added 3 commits November 18, 2020 02:10

Fix downloading of ACSM files

390456d

Fix the content type in the case of ACSM files

9d34f4c

Update the description of token_expiration_timeout configuration setting

1a4c36b

leonardr reviewed Nov 19, 2020

View reviewed changes

api/proquest/client.py Outdated Show resolved Hide resolved

vbessonov added 3 commits November 20, 2020 22:15

Change the order of the configuration settings

663a057

Update configuration settings

dd43e34

Fix the error description

9caec97

leonardr reviewed Nov 20, 2020

View reviewed changes

Add a short circuit breaker allowing to not reprocess already process…

f7a167b

…ed feeds

vbessonov mentioned this pull request Nov 23, 2020

Add ProQuest integration NYPL-Simplified/server_core#1219

Merged

2 tasks

vbessonov added 2 commits November 23, 2020 08:18

Merge remote-tracking branch 'upstream/develop' into feature/proquest…

f11f4cc

…-integration

Fix the docstrings

c4938c6

leonardr reviewed Nov 23, 2020

View reviewed changes

Comment out the short-circuit in ProQuestOPDS2Importer

d34b4eb

leonardr approved these changes Dec 2, 2020

View reviewed changes

vbessonov added 2 commits December 9, 2020 20:24

Update core submodule

7a82f35

Merge remote-tracking branch 'upstream/develop' into feature/proquest…

5050c3d

…-integration

leonardr merged commit b1f0f94 into NYPL-Simplified:develop Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ProQuest integration #1520

Add ProQuest integration #1520

vbessonov commented Nov 8, 2020 •

edited

leonardr left a comment

leonardr Nov 16, 2020

vbessonov Nov 17, 2020

leonardr Nov 17, 2020

vbessonov Nov 17, 2020

leonardr Nov 17, 2020

leonardr Nov 16, 2020

vbessonov Nov 17, 2020

leonardr Nov 17, 2020

leonardr Nov 16, 2020

leonardr Nov 16, 2020

vbessonov Nov 17, 2020

leonardr Nov 17, 2020

vbessonov Nov 20, 2020

leonardr Nov 20, 2020 •

edited

leonardr Nov 20, 2020

vbessonov Nov 22, 2020

leonardr commented Nov 20, 2020

vbessonov commented Nov 23, 2020

leonardr Nov 23, 2020

leonardr Nov 23, 2020

leonardr commented Nov 23, 2020

vbessonov commented Nov 23, 2020

leonardr commented Nov 30, 2020

leonardr commented Nov 30, 2020

vbessonov commented Dec 2, 2020

leonardr commented Dec 2, 2020

leonardr commented Dec 2, 2020

vbessonov commented Dec 2, 2020


		self._logger = logging.getLogger(__name__)

		def _parse_feed(self, feed, silent=True):

Add ProQuest integration #1520

Add ProQuest integration #1520

Conversation

vbessonov commented Nov 8, 2020 • edited

Description

Motivation and Context

How Has This Been Tested?

Checklist:

leonardr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leonardr Nov 20, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leonardr commented Nov 20, 2020

vbessonov commented Nov 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leonardr commented Nov 23, 2020

vbessonov commented Nov 23, 2020

leonardr commented Nov 30, 2020

leonardr commented Nov 30, 2020

vbessonov commented Dec 2, 2020

leonardr commented Dec 2, 2020

leonardr commented Dec 2, 2020

vbessonov commented Dec 2, 2020

vbessonov commented Nov 8, 2020 •

edited

leonardr Nov 20, 2020 •

edited