Add abstracts entities into the KG #39

ceteri · 2020-01-17T20:56:59Z

Add the publication abstracts into the KG, based on stage3 results from discovery APIs.

Best to use run_stage3.py as a template, although the main part to reuse is how it iterates through partitions in the BUCKET_STAGE

@ernestogimeno has also worked with this code and can help guide/advise.

Will need to pull the abstract field from metadata, where it exists. The responses from Semantic Scholar tend to have these -- and we may be able to extend other API calls to get abstracts. @lobodemonte can assist on those extensions.

The end goal will be to include abstracts as metadata in the graph -- where available -- and then also run these through a later stage that runs the TextRank algorithm to extract key phrases.

The text was updated successfully, but these errors were encountered:

ceteri · 2020-01-25T18:21:50Z

See Coleridge-Initiative/rclc#10 (comment)

JasonZhangzy1757 · 2020-02-24T18:12:24Z

Hi, @ceteri

I think now I understand how everything works in the template code, especially how it iterates through partitions in the Bucket_Stage. But I got a little confused about the input and output. Where is the metadata I am going to pull the abstract from? Is it BUCKET_STAGE, if so, where should I output the result(is it BUCKET_FINAL?), in other words, how could I add anything into the KG, could you explain a little more about it?

ceteri · 2020-02-25T02:48:06Z

Great. For overall structure, probably copy code similar to this from run_step3.py would be an appropriate starting point:

    # initialize the federated API access                                                                                    
    schol = rc_scholapi.ScholInfraAPI(config_file="rc.cfg", logger=None)
    graph = rc_graph.RCGraph("abstract")

    # for each publication: enrich metadata, gather the DOIs, etc.                                                           
    for partition, pub_iter in graph.iter_publications(graph.BUCKET_STAGE, filter=args.partition):
	pub_list = []

        for pub in tqdm(pub_iter, ascii=True, desc=partition[:30]):
            pub_list.append(pub)
           match = lookup_abstract(schol, graph, partition, pub)

            if match:
                graph.publications.abs_hits += 1
            else:
                graph.misses.append(pub["title"])

        graph.write_partition(graph.BUCKET_STAGE, partition, pub_list)

    # report errors                                                                                                          
    status = "{} successful abstract lookups".format(graph.publications.abs_hits)
    graph.report_misses(status, "publications that failed every abstract lookup")

... then with the added definition for lookup_abstract()

Using the 20200204_tdc_newyork_publications.json partition as an example data source, there's a fragment of metadata from Semantic Scholar which has an abstract:

"Semantic Scholar": {
  "abstract": "Opportunity NYC – Work Rewards is testing three ways of increasing work among families receiving housing vouchers — services and a savings plan under the federal Family Self-Sufficiency (FSS) program, the FSS program plus cash incentives for sustained full-time work, and the cash incentives alone. Early results suggest intriguing positive findings for certain subgroups."

Another function could serve as an accessor, something with code like:

source = schol.semantic

if (source in pub) and ("abstract" in pub[source]):
    meta = pub[source]
    abstract = meta["abstract"]
    pub["abstract"] = abstract

That moves the abstract into the top level metadata for that publication, along with title, doi, datasets, authors, etc. And then the code at the top of this handles writing the same partition back to the bucket:

graph.write_partition(graph.BUCKET_STAGE, partition, pub_list)

A larger question is whether Semantic Scholar is the only source of abstracts? It may be that we need to expand the RCApi list of APIs and what they returns -- for example PubMed, OpenAIRE, CORE, etc., and @lobodemonte can help there.

JasonZhangzy1757 · 2020-02-25T18:58:18Z

Hi @ceteri

Thanks so much! Your instruction is so detailed and it's very helpful. So I have created a PR in #58 with the code. Hope I get what you mean and it is what you want. Now I have a much clearer idea of how to add something in the KG and also get more familiar with graph and schol libraries. Please let me know if there should be any change or anything else I could help with :)

ceteri added the enhancement New feature or request label Jan 17, 2020

ceteri self-assigned this Jan 17, 2020

This was referenced Jan 17, 2020

Prototype a data impact factor #31

Open

Prototype recommender systems #35

Open

ceteri changed the title ~~Link other required entities into the KG: abstracts~~ Add abstracts entities into the KG Feb 19, 2020

ceteri removed their assignment Feb 19, 2020

JasonZhangzy1757 self-assigned this Feb 19, 2020

ceteri closed this as completed Feb 25, 2020

ceteri linked a pull request Feb 25, 2020 that will close this issue

Add abstract into KG #58

Merged

JasonZhangzy1757 mentioned this issue Jun 25, 2020

Find more abstracts and add them into the KG #96

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add abstracts entities into the KG #39

Add abstracts entities into the KG #39

ceteri commented Jan 17, 2020 •

edited

ceteri commented Jan 25, 2020

JasonZhangzy1757 commented Feb 24, 2020

ceteri commented Feb 25, 2020

JasonZhangzy1757 commented Feb 25, 2020

Add abstracts entities into the KG #39

Add abstracts entities into the KG #39

Comments

ceteri commented Jan 17, 2020 • edited

ceteri commented Jan 25, 2020

JasonZhangzy1757 commented Feb 24, 2020

ceteri commented Feb 25, 2020

JasonZhangzy1757 commented Feb 25, 2020

ceteri commented Jan 17, 2020 •

edited