-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doab load #363
Conversation
… the DOAB records into json format yet.
* assuming that 1 DOAB ID associated at most with 1 Work or 1 Edition * explicitly throw exception if Google Books doesn't recognize the ISBN in question
Great. I'll take a quick look, but probably will need Monday to absorb. |
I've also started a new DOAB repo with the code I wrote to generate the doab.json file for loading: https://github.com/Gluejar/DOAB |
# if there is no such work, try to find an Edition with the seed_isbn and use that work to hang off of | ||
|
||
except models.Identifier.DoesNotExist: | ||
sister_edition = add_by_isbn(seed_isbn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will fail if google and unglue.it don't have the isbn. but we have enough metadata to create a work, don't we? does doab tell you the language?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have enough to create a Work and Edition to which we can attach an Ebook. (Do you know off the top of your head what the minimal requirements are -- i.e., what fields are mandatory. A quick glance at the model definition of Work shows that the language field is a necessity:
Line 987 in 03f3dd1
language = models.CharField(max_length=5, default="en", null=False) |
About language metadata in DOAB. Yes, DOAB does have language metadata, but I don't currently have it included in https://github.com/Gluejar/regluit/blob/doab_load/bookdata/doab.json.
There are some problems in the language metadata however. In the 1953 records I'm currently working with, 303 of them have no language information. The remaining language distribution is as follows:
English 965
German 186
de 140
fr 124
Italian 82
Dutch 40
english 19
en 17
French 16
En 14
italian 9
de^it^rm 7
de^English 5
german 3
Czech 3
Deutsch 2
Italian / English 2
English; French 1
Englilsh ; Cree 1
English; French; Cree; Michif; Chinese; Ukrainian 1
Englisch 1
Spanish 1
English; 1
de^la 1
de^English^fr^es 1
English; Italian 1
Espanol 1
Welsh 1
English; Czech 1
Englilsh 1
German; 1
German; English 1
Russian; 1
Clearly, if we are going to make use of this language metadata, I'll have to do some cleaning and mappping of the metadata into ISO (?) language codes we're currently using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're using lang code xx for unknown
yes, we should definitely load all isbns in the doab record. Otherwise we'll have nagging duplicates. I know we discussed this, but are the urls that we have download urls or access urls? Perhaps we can talk about it in the morning. |
In terms of download vs access urls. For each of the URLs coming from DOAB, I did a HTTP HEAD to compute the content-type. The loader only uploads records with content-type of pdf at the moment. Line 99 in f43d40b
|
ok, I guess I need to look at that work |
Eric, you wrote:
Can you elaborate? Are you saying that I should create an edition for each ISBN? What kind of problem will we have with "nagging duplicates" if I loaded only one ISBN per DOAB id? |
1- our current model is one isbn per edition. suppose we have an ebook in our database under isbn2, and doab has (isbn1 and isbn2) if you just do isbn1, then we get 2 works that will need merging by hand. |
I'm going to try the following algorithm:
|
…nguage of the works, produced by Gluejar/DOAB@57e54e0
…rol before updating it.
so, doab_id is a work identifier, not an edition identifier, right? if so, the edition shouldn't be in the identifier record. (Have not checked that this is what the doab load code does.) |
I've been operating under the misconception that for any Identifier, not only must a Work be attached by also an Edition. I see that we have plenty of Identifiers with null for edition but every Identifier is tied to some Work. I will adjust my code accordingly....because, yes, I think of doab_id as a work level identifier. (Note, however, that the algorithm I'm trying out is force-merging the works that emerge from our thingisbn+Google language clustering.) |
…args to load_doab_edition functions in regluit.core.doab and regluit.core.tasks.
…s --> at least in ipynb form
…m notebook. Code in doab_loading.ipynb for testing the loading
Is this ready for re-review? |
Not quite -- though the new function looks like it works well. I'm writing some tests. Hopefully the PR will be ready for re-review later today. |
I could possibly load subject, creator, publication date, and publisher metadata too -- metadata I didn't write out to the doab.json file yet. Checking to make sure all this metadata useful -- in particular the subject metadata. We don't make use of subject metadata at the moment, right? |
subject metadata- no, but we need to, soon! |
I was surprised to find invalid ISBNs in the DOAB records -- 12 of the 900 or so ISBNs for records w/ known licenses for pdfs are invalid. [on second thought]: let me be more precise....the ISBNs in the records do not have the proper checksums. Is it possible that the actual ISBNs of books not have proper checksums? Or does the presence of numbers that don't have good checksums mean that the the ISBN metadata must be wrong? |
That error rate is less than what I'd expect for a non-checked corpus. Not worth trying to fix them.Just discard them. |
…aching IDs to it. code in doab_load.ipynb to load books and test the integrity of the load. Big surprise (to me): to find invalid ISBNs in the DOAB data
For books that have Google books IDs, I'm just using whatever covers Google books serves up. For those doab records that don't yield a Google Book ID, I'm going to set work.preferred_edition.cover_image to the DOAB cover URL -- which is of the form http://www.doabooks.org/doab?func=cover&rid=12592. I could possibly set all the DOAB work covers to the DOAB covers. What do you think? |
I would retrieve the doab cover and re-serve it. |
…f the cover is not already there.)
… our S3 space and integrating that code into uploading the DOAB records
…he signatures of some methods
…escription, edition.publication_date, work.subjects, and edition.publisher_name
Eric: I think it's worth your looking at this PR again. |
edition.publication_date = publication_date | ||
edition_to_save = True | ||
|
||
# TO DO: insert publisher name properly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use edition.set_publisher(publisher_name)
one easy addition, everything else looks good |
OK -- I think we're ready to merge this into master. There is more work to do, but it'd be good to load this set of books. |
agreed |
Current status of the doab_load branch:
Possible next steps to do (whether before or after a merge):