Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged nodes.tsv has 3 lines with too many tabs #201

Closed
justaddcoffee opened this issue Jun 3, 2020 · 4 comments
Closed

Merged nodes.tsv has 3 lines with too many tabs #201

justaddcoffee opened this issue Jun 3, 2020 · 4 comments
Labels
bug Something isn't working

Comments

@justaddcoffee
Copy link
Collaborator

(venv) ~/PycharmProjects/kg-emerging-viruses/data/merged *add_pubmed_info $ awk '{print gsub(/\t/,"")}' merged-kg_nodes.tsv | sort | uniq -c
466916 11
   1 175430
   1 85874
   1 88892
@justaddcoffee justaddcoffee added the bug Something isn't working label Jun 3, 2020
@justaddcoffee
Copy link
Collaborator Author

justaddcoffee commented Jun 3, 2020

This issue seems to arise in the ontology ingest, for example this line with thousands of tabs is emitted from the GO-plus transform:

biolink:AnatomicalEntity	An atrioventricular valve that is part of the outflow part of the right atrium.	UBERON:0002134	http://purl.obolibrary.org/obo/UBERON_0002134	tricuspid valve	valvula tricuspidalis|valva atrioventricularis dextra|right atrioventricular valve
		http://purl.obolibrary.org/obo/CHEBI_77148			
		http://purl.obolibrary.org/obo/GOCHE_35610			
		http://purl.obolibrary.org/obo/UBERON_0000004 [GOES ON FOR QUITE A WHILE...]

@deepakunni3 could we chat about this when you get a chance? I was going to make a PR in kg-covid-19, but it looks like this TSV is made in KGX, so might need a PR there instead...

@justaddcoffee
Copy link
Collaborator Author

justaddcoffee commented Jun 4, 2020

Still seeing this after merging #207 to hex-ify tabs in ingested data and after @deepakunni3 updated KGX to deal with internal \"s

$ wget http://kg-hub.berkeleybop.io/kg-covid-19.tar.gz
$ tar -xvzf kg-covid-19.tar.gz
$ awk '{print gsub(/\t/,"")}' merged-kg_nodes.tsv | sort | uniq -c
466916 11
   1 175430
   1 85874
   1 88892

@deepakunni3
Copy link
Member

Somehow I am not seeing this.

awk '{print gsub(/\t/,"")}' merged-kg_nodes.tsv | sort | uniq -c
465505 11

@justaddcoffee
Copy link
Collaborator Author

You're right Deepak

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants