Doab load #363

rdhyee · 2014-06-07T00:05:25Z

Current status of the doab_load branch:

I wrote code (that is not in this branch) that uses the CSV and OAI-MHP interfaces of http://www.doabooks.org/ that converts that metadata into doab.json (https://github.com/Gluejar/regluit/blob/doab_load/bookdata/doab.json).
You can load the works in doab.json via the doab_load_books command:

django-admin.py doab_load_books

Possible next steps to do (whether before or after a merge):

instead of including only a single ISBN per record, I should include all the ISBNs available in the DOAB records. Then have logic to check whether the loading for the work succeeded and then retry
write some code to do the reloading
add a period task that looks for updates in doab and load those books

… the DOAB records into json format yet.

…ery)

* assuming that 1 DOAB ID associated at most with 1 Work or 1 Edition * explicitly throw exception if Google Books doesn't recognize the ISBN in question

eshellman · 2014-06-07T00:47:30Z

Great. I'll take a quick look, but probably will need Monday to absorb.

rdhyee · 2014-06-26T19:13:15Z

I've also started a new DOAB repo with the code I wrote to generate the doab.json file for loading: https://github.com/Gluejar/DOAB

eshellman · 2014-06-26T23:41:15Z

core/doab.py

+    # if there is no such work, try to find an Edition with the seed_isbn and use that work to hang off of
+
+    except models.Identifier.DoesNotExist: 
+        sister_edition = add_by_isbn(seed_isbn)


this will fail if google and unglue.it don't have the isbn. but we have enough metadata to create a work, don't we? does doab tell you the language?

I think we have enough to create a Work and Edition to which we can attach an Ebook. (Do you know off the top of your head what the minimal requirements are -- i.e., what fields are mandatory. A quick glance at the model definition of Work shows that the language field is a necessity:

regluit/core/models.py

Line 987 in 03f3dd1

language = models.CharField(max_length=5, default="en", null=False)

. When I wrote this code (as a first pass), I was assuming that we needed to have Google recognize the ISBN.

About language metadata in DOAB. Yes, DOAB does have language metadata, but I don't currently have it included in https://github.com/Gluejar/regluit/blob/doab_load/bookdata/doab.json.

There are some problems in the language metadata however. In the 1953 records I'm currently working with, 303 of them have no language information. The remaining language distribution is as follows:

English 965
German 186
de 140
fr 124
Italian 82
Dutch 40
english 19
en 17
French 16
En 14
italian 9
de^it^rm 7
de^English 5
german 3
Czech 3
Deutsch 2
Italian / English 2
English; French 1
Englilsh ; Cree 1
English; French; Cree; Michif; Chinese; Ukrainian 1
Englisch 1
Spanish 1
English; 1
de^la 1
de^English^fr^es 1
English; Italian 1
Espanol 1
Welsh 1
English; Czech 1
Englilsh 1
German; 1
German; English 1
Russian; 1

Clearly, if we are going to make use of this language metadata, I'll have to do some cleaning and mappping of the metadata into ISO (?) language codes we're currently using.

we're using lang code xx for unknown

eshellman · 2014-06-27T02:24:55Z

yes, we should definitely load all isbns in the doab record. Otherwise we'll have nagging duplicates.

I know we discussed this, but are the urls that we have download urls or access urls? Perhaps we can talk about it in the morning.

rdhyee · 2014-06-27T17:04:45Z

In terms of download vs access urls. For each of the URLs coming from DOAB, I did a HTTP HEAD to compute the content-type. The loader only uploads records with content-type of pdf at the moment.

regluit/core/doab.py

Line 99 in f43d40b

if d['format'] == 'pdf':

eshellman · 2014-06-27T19:02:51Z

ok, I guess I need to look at that work

rdhyee · 2014-06-30T21:42:28Z

Eric, you wrote:

yes, we should definitely load all isbns in the doab record. Otherwise we'll have nagging duplicates.

Can you elaborate? Are you saying that I should create an edition for each ISBN? What kind of problem will we have with "nagging duplicates" if I loaded only one ISBN per DOAB id?

eshellman · 2014-07-01T01:06:33Z

1- our current model is one isbn per edition.

suppose we have an ebook in our database under isbn2, and doab has (isbn1 and isbn2) if you just do isbn1, then we get 2 works that will need merging by hand.

rdhyee · 2014-07-03T16:19:55Z

I'm going to try the following algorithm:

for each doab_id:
     for each isbn,
          try to find a google_id for the isbn (by add_by_isbn), otherwise create our own edition
          (asynchronously), populate related isbns for each of the isbns
     for each of the works associated with this list of editions:
          we can manually run merge_work on all of them pairwise

when we should end up w/ one work per doab_id --> tie  that doab_id the super-merged work.

…nguage of the works, produced by Gluejar/DOAB@57e54e0

…rol before updating it.

eshellman · 2014-07-03T17:19:33Z

so, doab_id is a work identifier, not an edition identifier, right? if so, the edition shouldn't be in the identifier record. (Have not checked that this is what the doab load code does.)

rdhyee · 2014-07-03T17:26:01Z

I've been operating under the misconception that for any Identifier, not only must a Work be attached by also an Edition. I see that we have plenty of Identifiers with null for edition but every Identifier is tied to some Work.

I will adjust my code accordingly....because, yes, I think of doab_id as a work level identifier. (Note, however, that the algorithm I'm trying out is force-merging the works that emerge from our thingisbn+Google language clustering.)

…args to load_doab_edition functions in regluit.core.doab and regluit.core.tasks.

…s --> at least in ipynb form

…m notebook. Code in doab_loading.ipynb for testing the loading

eshellman · 2014-07-07T18:48:56Z

Is this ready for re-review?

rdhyee · 2014-07-07T19:39:05Z

Not quite -- though the new function looks like it works well. I'm writing some tests. Hopefully the PR will be ready for re-review later today.

rdhyee · 2014-07-07T21:13:09Z

I could possibly load subject, creator, publication date, and publisher metadata too -- metadata I didn't write out to the doab.json file yet. Checking to make sure all this metadata useful -- in particular the subject metadata. We don't make use of subject metadata at the moment, right?

eshellman · 2014-07-07T21:46:28Z

subject metadata- no, but we need to, soon!

rdhyee · 2014-07-08T14:43:42Z

I was surprised to find invalid ISBNs in the DOAB records -- 12 of the 900 or so ISBNs for records w/ known licenses for pdfs are invalid.

[on second thought]: let me be more precise....the ISBNs in the records do not have the proper checksums. Is it possible that the actual ISBNs of books not have proper checksums? Or does the presence of numbers that don't have good checksums mean that the the ISBN metadata must be wrong?

eshellman · 2014-07-08T14:53:58Z

That error rate is less than what I'd expect for a non-checked corpus. Not worth trying to fix them.Just discard them.

…aching IDs to it. code in doab_load.ipynb to load books and test the integrity of the load. Big surprise (to me): to find invalid ISBNs in the DOAB data

rdhyee · 2014-07-08T17:37:18Z

For books that have Google books IDs, I'm just using whatever covers Google books serves up. For those doab records that don't yield a Google Book ID, I'm going to set work.preferred_edition.cover_image to the DOAB cover URL -- which is of the form http://www.doabooks.org/doab?func=cover&rid=12592.

I could possibly set all the DOAB work covers to the DOAB covers.

What do you think?

eshellman · 2014-07-08T17:39:20Z

I would retrieve the doab cover and re-serve it.

… storage

…f the cover is not already there.)

… our S3 space and integrating that code into uploading the DOAB records

…he signatures of some methods

…escription, edition.publication_date, work.subjects, and edition.publisher_name

rdhyee · 2014-07-24T23:30:50Z

Eric: I think it's worth your looking at this PR again.

eshellman · 2014-07-25T19:46:45Z

core/doab.py

+        edition.publication_date = publication_date
+        edition_to_save = True
+
+    # TO DO: insert publisher name properly


use edition.set_publisher(publisher_name)

eshellman · 2014-07-25T19:59:47Z

one easy addition, everything else looks good

rdhyee · 2014-07-25T22:17:48Z

OK -- I think we're ready to merge this into master. There is more work to do, but it'd be good to load this set of books.

Doab load

eshellman · 2014-07-27T18:42:40Z

agreed

rdhyee added 4 commits June 4, 2014 15:23

some code to load DOAB records...no code here yet for how I processed…

265420d

… the DOAB records into json format yet.

[#70942940] Making the DOAB record loading asynchronous (ie., use Cel…

efdbdf6

…ery)

adding a command to load doab books

54c3742

[#70942940] Making load_doab_edition more rigorous:

f43d40b

* assuming that 1 DOAB ID associated at most with 1 Work or 1 Edition * explicitly throw exception if Google Books doesn't recognize the ISBN in question

rdhyee assigned eshellman Jun 7, 2014

eshellman reviewed Jun 26, 2014
View reviewed changes

rdhyee added 2 commits July 3, 2014 09:55

the doab.json resulting from loading all the ISBNs, as well as the la…

2fac485

…nguage of the works, produced by Gluejar/DOAB@57e54e0

The notebook is a bit out of date but I want to put it under git cont…

fbc6c61

…rol before updating it.

rdhyee added 3 commits July 3, 2014 10:45

With the new arguments in doab.json, I needed to add a catch-all **kw…

833f077

…args to load_doab_edition functions in regluit.core.doab and regluit.core.tasks.

first draft of a rewrite of load_doab_edition to handle multiple ISBN…

0794545

…s --> at least in ipynb form

I've moved the new version of load_doab_edition into core/doab.py fro…

5b3a8d7

…m notebook. Code in doab_loading.ipynb for testing the loading

Fixed bug in load_doab_edition: new Work needs to be saved before att…

d14a0dc

…aching IDs to it. code in doab_load.ipynb to load books and test the integrity of the load. Big surprise (to me): to find invalid ISBNs in the DOAB data

rdhyee added 11 commits July 8, 2014 12:48

doab_loading.ipynb has some code now for loading DOAB cover images

4201d2d

Merge branch 'master' into doab_load

f40350c

start of writing code to download and write DOAB covers to local file…

ce05d4e

… storage

Merge branch 'master' into doab_load

7df31ee

Now I have a function to take a doab_id and upload the cover to S3 (i…

c3cfffe

…f the cover is not already there.)

I think I have a pretty good cut at code for uploading DOAB covers to…

359b7f0

… our S3 space and integrating that code into uploading the DOAB records

The doab_load_books django command is working again -- I had to fix t…

7f02fcd

…he signatures of some methods

Merge branch 'master' into doab_load

45a89ee

code that is basically working in IPython notebook for loading work.d…

0fad9dd

…escription, edition.publication_date, work.subjects, and edition.publisher_name

Merge branch 'master' into doab_load

3b696ca

code can now load description, subjects and covers for the pdf files

967bd2d

eshellman reviewed Jul 25, 2014
View reviewed changes

rdhyee added 2 commits July 25, 2014 14:45

Merge branch 'master' into doab_load

307db5a

Set publisher name for edition

440b352

rdhyee added a commit that referenced this pull request Jul 25, 2014

Merge pull request #363 from Gluejar/doab_load

550abf2

Doab load

rdhyee merged commit 550abf2 into master Jul 25, 2014

rdhyee deleted the doab_load branch October 26, 2014 00:53

eshellman mentioned this pull request Feb 6, 2017

Create Provider Model in unglue.it EbookFoundation/regluit#21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doab load #363

Doab load #363

rdhyee commented Jun 7, 2014

eshellman commented Jun 7, 2014

rdhyee commented Jun 26, 2014

eshellman Jun 26, 2014

rdhyee Jun 27, 2014

eshellman Jun 27, 2014

eshellman commented Jun 27, 2014

rdhyee commented Jun 27, 2014

eshellman commented Jun 27, 2014

rdhyee commented Jun 30, 2014

eshellman commented Jul 1, 2014

rdhyee commented Jul 3, 2014

eshellman commented Jul 3, 2014

rdhyee commented Jul 3, 2014

eshellman commented Jul 7, 2014

rdhyee commented Jul 7, 2014

rdhyee commented Jul 7, 2014

eshellman commented Jul 7, 2014

rdhyee commented Jul 8, 2014

eshellman commented Jul 8, 2014

rdhyee commented Jul 8, 2014

eshellman commented Jul 8, 2014

rdhyee commented Jul 24, 2014

eshellman Jul 25, 2014

eshellman commented Jul 25, 2014

rdhyee commented Jul 25, 2014

eshellman commented Jul 27, 2014

Doab load #363

Doab load #363

Conversation

rdhyee commented Jun 7, 2014

eshellman commented Jun 7, 2014

rdhyee commented Jun 26, 2014

eshellman Jun 26, 2014

Choose a reason for hiding this comment

rdhyee Jun 27, 2014

Choose a reason for hiding this comment

eshellman Jun 27, 2014

Choose a reason for hiding this comment

eshellman commented Jun 27, 2014

rdhyee commented Jun 27, 2014

eshellman commented Jun 27, 2014

rdhyee commented Jun 30, 2014

eshellman commented Jul 1, 2014

rdhyee commented Jul 3, 2014

eshellman commented Jul 3, 2014

rdhyee commented Jul 3, 2014

eshellman commented Jul 7, 2014

rdhyee commented Jul 7, 2014

rdhyee commented Jul 7, 2014

eshellman commented Jul 7, 2014

rdhyee commented Jul 8, 2014

eshellman commented Jul 8, 2014

rdhyee commented Jul 8, 2014

eshellman commented Jul 8, 2014

rdhyee commented Jul 24, 2014

eshellman Jul 25, 2014

Choose a reason for hiding this comment

eshellman commented Jul 25, 2014

rdhyee commented Jul 25, 2014

eshellman commented Jul 27, 2014