Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add abstracts entities into the KG #39

Closed
ceteri opened this issue Jan 17, 2020 · 4 comments · Fixed by #58
Closed

Add abstracts entities into the KG #39

ceteri opened this issue Jan 17, 2020 · 4 comments · Fixed by #58
Assignees
Labels
enhancement New feature or request

Comments

@ceteri
Copy link
Contributor

ceteri commented Jan 17, 2020

Add the publication abstracts into the KG, based on stage3 results from discovery APIs.

Best to use run_stage3.py as a template, although the main part to reuse is how it iterates through partitions in the BUCKET_STAGE

@ernestogimeno has also worked with this code and can help guide/advise.

Will need to pull the abstract field from metadata, where it exists. The responses from Semantic Scholar tend to have these -- and we may be able to extend other API calls to get abstracts. @lobodemonte can assist on those extensions.

The end goal will be to include abstracts as metadata in the graph -- where available -- and then also run these through a later stage that runs the TextRank algorithm to extract key phrases.

@ceteri ceteri added the enhancement New feature or request label Jan 17, 2020
@ceteri ceteri self-assigned this Jan 17, 2020
@ceteri
Copy link
Contributor Author

ceteri commented Jan 25, 2020

@ceteri ceteri changed the title Link other required entities into the KG: abstracts Add abstracts entities into the KG Feb 19, 2020
@ceteri ceteri removed their assignment Feb 19, 2020
@JasonZhangzy1757 JasonZhangzy1757 self-assigned this Feb 19, 2020
@JasonZhangzy1757
Copy link
Collaborator

Hi, @ceteri

I think now I understand how everything works in the template code, especially how it iterates through partitions in the Bucket_Stage. But I got a little confused about the input and output. Where is the metadata I am going to pull the abstract from? Is it BUCKET_STAGE, if so, where should I output the result(is it BUCKET_FINAL?), in other words, how could I add anything into the KG, could you explain a little more about it?

@ceteri
Copy link
Contributor Author

ceteri commented Feb 25, 2020

Great. For overall structure, probably copy code similar to this from run_step3.py would be an appropriate starting point:

    # initialize the federated API access                                                                                    
    schol = rc_scholapi.ScholInfraAPI(config_file="rc.cfg", logger=None)
    graph = rc_graph.RCGraph("abstract")

    # for each publication: enrich metadata, gather the DOIs, etc.                                                           
    for partition, pub_iter in graph.iter_publications(graph.BUCKET_STAGE, filter=args.partition):
	pub_list = []

        for pub in tqdm(pub_iter, ascii=True, desc=partition[:30]):
            pub_list.append(pub)
           match = lookup_abstract(schol, graph, partition, pub)

            if match:
                graph.publications.abs_hits += 1
            else:
                graph.misses.append(pub["title"])

        graph.write_partition(graph.BUCKET_STAGE, partition, pub_list)

    # report errors                                                                                                          
    status = "{} successful abstract lookups".format(graph.publications.abs_hits)
    graph.report_misses(status, "publications that failed every abstract lookup")

... then with the added definition for lookup_abstract()


Using the 20200204_tdc_newyork_publications.json partition as an example data source, there's a fragment of metadata from Semantic Scholar which has an abstract:

"Semantic Scholar": {
  "abstract": "Opportunity NYC – Work Rewards is testing three ways of increasing work among families receiving housing vouchers — services and a savings plan under the federal Family Self-Sufficiency (FSS) program, the FSS program plus cash incentives for sustained full-time work, and the cash incentives alone. Early results suggest intriguing positive findings for certain subgroups."

Another function could serve as an accessor, something with code like:

source = schol.semantic

if (source in pub) and ("abstract" in pub[source]):
    meta = pub[source]
    abstract = meta["abstract"]
    pub["abstract"] = abstract

That moves the abstract into the top level metadata for that publication, along with title, doi, datasets, authors, etc. And then the code at the top of this handles writing the same partition back to the bucket:

graph.write_partition(graph.BUCKET_STAGE, partition, pub_list)

A larger question is whether Semantic Scholar is the only source of abstracts? It may be that we need to expand the RCApi list of APIs and what they returns -- for example PubMed, OpenAIRE, CORE, etc., and @lobodemonte can help there.

@JasonZhangzy1757
Copy link
Collaborator

Hi @ceteri

Thanks so much! Your instruction is so detailed and it's very helpful. So I have created a PR in #58 with the code. Hope I get what you mean and it is what you want. Now I have a much clearer idea of how to add something in the KG and also get more familiar with graph and schol libraries. Please let me know if there should be any change or anything else I could help with :)

@ceteri ceteri closed this as completed Feb 25, 2020
@ceteri ceteri linked a pull request Feb 25, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants