Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doab load #363

Merged
merged 23 commits into from
Jul 25, 2014
Merged

Doab load #363

merged 23 commits into from
Jul 25, 2014

Conversation

rdhyee
Copy link
Member

@rdhyee rdhyee commented Jun 7, 2014

Current status of the doab_load branch:

django-admin.py doab_load_books

Possible next steps to do (whether before or after a merge):

  • instead of including only a single ISBN per record, I should include all the ISBNs available in the DOAB records. Then have logic to check whether the loading for the work succeeded and then retry
  • write some code to do the reloading
  • add a period task that looks for updates in doab and load those books

* assuming that 1 DOAB ID associated at most with 1 Work or 1 Edition
* explicitly throw exception if Google Books doesn't recognize the ISBN in question
@eshellman
Copy link
Contributor

Great. I'll take a quick look, but probably will need Monday to absorb.

@rdhyee
Copy link
Member Author

rdhyee commented Jun 26, 2014

I've also started a new DOAB repo with the code I wrote to generate the doab.json file for loading: https://github.com/Gluejar/DOAB

# if there is no such work, try to find an Edition with the seed_isbn and use that work to hang off of

except models.Identifier.DoesNotExist:
sister_edition = add_by_isbn(seed_isbn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will fail if google and unglue.it don't have the isbn. but we have enough metadata to create a work, don't we? does doab tell you the language?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have enough to create a Work and Edition to which we can attach an Ebook. (Do you know off the top of your head what the minimal requirements are -- i.e., what fields are mandatory. A quick glance at the model definition of Work shows that the language field is a necessity:

language = models.CharField(max_length=5, default="en", null=False)
. When I wrote this code (as a first pass), I was assuming that we needed to have Google recognize the ISBN.

About language metadata in DOAB. Yes, DOAB does have language metadata, but I don't currently have it included in https://github.com/Gluejar/regluit/blob/doab_load/bookdata/doab.json.

There are some problems in the language metadata however. In the 1953 records I'm currently working with, 303 of them have no language information. The remaining language distribution is as follows:

English 965
German 186
de 140
fr 124
Italian 82
Dutch 40
english 19
en 17
French 16
En 14
italian 9
de^it^rm 7
de^English 5
german 3
Czech 3
Deutsch 2
Italian / English 2
English; French 1
Englilsh ; Cree 1
English; French; Cree; Michif; Chinese; Ukrainian 1
Englisch 1
Spanish 1
English; 1
de^la 1
de^English^fr^es 1
English; Italian 1
Espanol 1
Welsh 1
English; Czech 1
Englilsh 1
German; 1
German; English 1
Russian; 1

Clearly, if we are going to make use of this language metadata, I'll have to do some cleaning and mappping of the metadata into ISO (?) language codes we're currently using.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're using lang code xx for unknown

@eshellman
Copy link
Contributor

yes, we should definitely load all isbns in the doab record. Otherwise we'll have nagging duplicates.

I know we discussed this, but are the urls that we have download urls or access urls? Perhaps we can talk about it in the morning.

@rdhyee
Copy link
Member Author

rdhyee commented Jun 27, 2014

In terms of download vs access urls. For each of the URLs coming from DOAB, I did a HTTP HEAD to compute the content-type. The loader only uploads records with content-type of pdf at the moment.

if d['format'] == 'pdf':

@eshellman
Copy link
Contributor

ok, I guess I need to look at that work

@rdhyee
Copy link
Member Author

rdhyee commented Jun 30, 2014

Eric, you wrote:

yes, we should definitely load all isbns in the doab record. Otherwise we'll have nagging duplicates.

Can you elaborate? Are you saying that I should create an edition for each ISBN? What kind of problem will we have with "nagging duplicates" if I loaded only one ISBN per DOAB id?

@eshellman
Copy link
Contributor

1- our current model is one isbn per edition.

suppose we have an ebook in our database under isbn2, and doab has (isbn1 and isbn2) if you just do isbn1, then we get 2 works that will need merging by hand.

@rdhyee
Copy link
Member Author

rdhyee commented Jul 3, 2014

I'm going to try the following algorithm:

for each doab_id:
     for each isbn,
          try to find a google_id for the isbn (by add_by_isbn), otherwise create our own edition
          (asynchronously), populate related isbns for each of the isbns
     for each of the works associated with this list of editions:
          we can manually run merge_work on all of them pairwise

when we should end up w/ one work per doab_id --> tie  that doab_id the super-merged work.

@eshellman
Copy link
Contributor

so, doab_id is a work identifier, not an edition identifier, right? if so, the edition shouldn't be in the identifier record. (Have not checked that this is what the doab load code does.)

@rdhyee
Copy link
Member Author

rdhyee commented Jul 3, 2014

I've been operating under the misconception that for any Identifier, not only must a Work be attached by also an Edition. I see that we have plenty of Identifiers with null for edition but every Identifier is tied to some Work.

I will adjust my code accordingly....because, yes, I think of doab_id as a work level identifier. (Note, however, that the algorithm I'm trying out is force-merging the works that emerge from our thingisbn+Google language clustering.)

…args to load_doab_edition

functions in regluit.core.doab and regluit.core.tasks.
…m notebook.

Code in doab_loading.ipynb for testing the loading
@eshellman
Copy link
Contributor

Is this ready for re-review?

@rdhyee
Copy link
Member Author

rdhyee commented Jul 7, 2014

Not quite -- though the new function looks like it works well. I'm writing some tests. Hopefully the PR will be ready for re-review later today.

@rdhyee
Copy link
Member Author

rdhyee commented Jul 7, 2014

I could possibly load subject, creator, publication date, and publisher metadata too -- metadata I didn't write out to the doab.json file yet. Checking to make sure all this metadata useful -- in particular the subject metadata. We don't make use of subject metadata at the moment, right?

@eshellman
Copy link
Contributor

subject metadata- no, but we need to, soon!

@rdhyee
Copy link
Member Author

rdhyee commented Jul 8, 2014

I was surprised to find invalid ISBNs in the DOAB records -- 12 of the 900 or so ISBNs for records w/ known licenses for pdfs are invalid.

[on second thought]: let me be more precise....the ISBNs in the records do not have the proper checksums. Is it possible that the actual ISBNs of books not have proper checksums? Or does the presence of numbers that don't have good checksums mean that the the ISBN metadata must be wrong?

@eshellman
Copy link
Contributor

That error rate is less than what I'd expect for a non-checked corpus. Not worth trying to fix them.Just discard them.

…aching IDs to it.

code in doab_load.ipynb to load books and test the integrity of the load.  Big surprise (to me):  to find invalid ISBNs in the DOAB data
@rdhyee
Copy link
Member Author

rdhyee commented Jul 8, 2014

For books that have Google books IDs, I'm just using whatever covers Google books serves up. For those doab records that don't yield a Google Book ID, I'm going to set work.preferred_edition.cover_image to the DOAB cover URL -- which is of the form http://www.doabooks.org/doab?func=cover&rid=12592.

I could possibly set all the DOAB work covers to the DOAB covers.

What do you think?

@eshellman
Copy link
Contributor

I would retrieve the doab cover and re-serve it.

@rdhyee
Copy link
Member Author

rdhyee commented Jul 24, 2014

Eric: I think it's worth your looking at this PR again.

edition.publication_date = publication_date
edition_to_save = True

# TO DO: insert publisher name properly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use edition.set_publisher(publisher_name)

@eshellman
Copy link
Contributor

one easy addition, everything else looks good

@rdhyee
Copy link
Member Author

rdhyee commented Jul 25, 2014

OK -- I think we're ready to merge this into master. There is more work to do, but it'd be good to load this set of books.

rdhyee added a commit that referenced this pull request Jul 25, 2014
@rdhyee rdhyee merged commit 550abf2 into master Jul 25, 2014
@eshellman
Copy link
Contributor

agreed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants