Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roundtrip differences #6

Open
stuppie opened this issue Nov 16, 2018 · 2 comments
Open

Roundtrip differences #6

stuppie opened this issue Nov 16, 2018 · 2 comments

Comments

@stuppie
Copy link
Collaborator

stuppie commented Nov 16, 2018

I ran a roundtrip, from the files here, and then dumped them out (to nodes_out.csv and edges_out.csv).

Comparing nodes file

  • There are columns "synonyms:IGNORE" and "name" in the input file. I merged "name" with the synonyms, so I cannot separate it back out, so the name column is always blank in the output file, and the synonyms column may contain what used to be in "name".
  • Missing value are blank in the output and "NA" in the input. Is this important?

Looking only at the IDs
$ cut -f1 -d, nodes_out.csv | sort > nodes_out_id.csv
$ cut -f1 -d, ngly1_concepts.csv | sort > ngly1_concepts_sort.csv
$ diff nodes_out_id.csv ngly1_concepts_sort.csv
Result: everything is there except for the 4 items with huge IDs (#2)

Comparing edges file

  • I ignored the column "reference_date" in the input file.
  • Nuria's file has some edges where the prop is "None". Ignore those

$ cut -f1-3 -d, edges_out.csv | sort > edges_out_id.csv
$ cut -f1-3 -d, ngly1_statements.csv | grep -v ",None," | sort > ngly1_statements_id.csv
$ wc -l edges_out_id.csv ngly1_statements_id.csv
786913 edges_out_id.csv
791161 ngly1_statements_id.csv
We're missing 4248 lines...

Which subj IDs am I missing?
$ diff -U0 =(cut -f1 -d, edges_out_id.csv) =(cut -f1 -d, ngly1_statements_id.csv) | grep -E "^+" | uniq -c

      1 +FlyBase:FBgn0000180
      1 +HGNC:17646
      1 +HGNC:633
   2827 +HGNC:6914
   1402 +HGNC:8031
      2 +MGI:102709
      1 +MGI:103201
      1 +RGD:2141
      3 +RGD:2280
      1 +SGD:S000000763
      1 +UniProt:O94778
      1 +UniProt:P29972
      1 +UniProt:P30301
      1 +UniProt:P41181
      1 +UniProt:P55064
      1 +UniProt:P55087
      1 +UniProt:Q13520
      1 +UniProt:Q9UKM7

Missing 2827 from HGNC:6914 and 1402 from HGNC:8031, which we know.
What are the 19 others?

$ diff -U0 =(grep -v HGNC:6914 edges_out_id.csv | grep -v HGNC:8031 | cut -f2 -d,) =(grep -v HGNC:6914 ngly1_statements_id.csv | grep -v HGNC:8031 | cut -f2 -d,) | grep -E "^+" | uniq -c

  1 +RO:0002200
  2 +RO:0002331
  7 +colocalizes_with
  1 +contributes_to
  8 +rdf:type

The rdf:type issue: #5
I know about colocalizes_with and contributes_to (NuriaQueralt/ngly1-graph#3)

For the other two, these look like weird edge cases. For example

FlyBase:FBgn0000180,RO:0002200,FBcv:0000435,NA,NA,NA,has phenotype,NA,http://purl.obolibrary.org/obo/RO_0002200
FlyBase:FBgn0000180,RO:0002200,FBcv:0000435,https://www.ncbi.nlm.nih.gov/pubmed/15534205,This edge comes from the Monarch Knowledge Graph 2018.,NA,has phenotype,NA,http://purl.obolibrary.org/obo/RO_0002200

There are two lines for the same edge in the input file. One has no ref, one does. So in wikidata, they become one. One output, we end up with one line instead of two. This isn't an issue as we aren't actually missing anything.

@stuppie
Copy link
Collaborator Author

stuppie commented Nov 19, 2018

Re-checked roundtrip as of 9eab7ec
Everything looks good except for the following known issues:

Comparing nodes file

  • There are columns "synonyms:IGNORE" and "name" in the input file. I merged "name" with the synonyms, so I cannot separate it back out, so the name column is always blank in the output file, and the synonyms column may contain what used to be in "name".
  • Missing value are blank in the output and "NA" in the input. Is this important?

Comparing edges file

  • I ignored the column "reference_date" in the input file.
  • Nuria's file has some edges where the prop is "None". Ignore those.
  • If there are two lines for the same edge in the input file. One has no ref, one does. So in wikidata, they become one. On output, we end up with one line instead of two. This isn't an issue as we aren't actually missing anything. But we are missing that line when comparing the roundtrip.
  • colocalizes_with and contributes_to (colocalizes_with, contributes_to NuriaQueralt/ngly1-graph#3) are not named, but use the URI

@NuriaQueralt
Copy link
Contributor

NuriaQueralt commented Dec 6, 2018

Confirming Krusty::wd_to_neo4j.py

I run the module and uploaded the CVS (wikibase dump) into neo4j server.

Data counts

  • All looks good!

    | CSV-ngly1(network) | CSV-wikibase(dump) | Neo4j (NGLY1) | Neo4j (Wikibase) | Wikibase
    ---|----------------------------|-----------------------------|-----------------------|------------------------|--------------
    Nodes | 15,786 | 15,782 | 15,786 | 15,782 | 15,782
    Edges | 792,844 | 786,920 | 792,844 | 786,920 | ?

commands used for nodes files

  • result: 4 missing node instances as expected #6
    Looking only at the IDs
    $ cut -f1 -d, wikibase_concept_out.csv | sort > wikibase_concept_out_sort_id.csv
    $ cut -f1 -d, ngly1_concepts.csv | sort > ngly1_concepts_sort_id.csv
    $ diff wikibase_concept_out_sort_id.csv ngly1_concepts_sort_id.csv
    Result: everything is there except for the 4 items with huge IDs (items with too long of an ID fail #2)

commands used for edges files

  • result: we're missing 4240 lines..
    $ cut -f1-3 -d, wikibase_statements_out.csv | sort > wikibase_statements_out_sort_id.csv
    $ cut -f1-3 -d, ngly1_statements.csv | grep -v ",None," | sort > ngly1_statements_sort_id.csv
    $ wc -l wikibase_statements_out_sort_id.csv ngly1_statements_sort_id.csv
    786921 wikibase_statements_out_sort_id.csv
    791161 ngly1_statements_sort_id.csv

commands used for nodes in neo4j

MATCH (n)
RETURN count(n)

commands used for edges in neo4j

MATCH ()-->() RETURN count(*)

sparql queries for wikibase for nodes

SELECT (COUNT(DISTINCT ?s) AS ?sc)
WHERE{ ?s wdt:P8 ?o } 

sparql queries for wikibase for edges
* i don't know how to query this

Comparing neo4j networks in the brower

  • node attributes: preflabel, name, description and id
  • edge attributes: property_label, property_uri, reference_{uri|date|supporting_text}
  • node and edge attributes don't appear if the value is blank

Comparing nodes file

  • Missing value are blank in the output and "NA" in the input. Is this important?: It is not important. The blank attributes instead of "NA", which is the standard format for Neo4j, cause that these attributes don't appear in the browser. My library is gonna take care of that.
  • name some instances have the format preflabel (ID) that can be confusing for the user. It is because some items in wikibase have the same label.
  • synonyms:IGNORED = Concatenated synonyms and name with "|".
  • There are columns "synonyms:IGNORE" and "name" in the input file. I merged "name" with the synonyms, so I cannot separate it back out, so the name column is always blank in the output file, and the synonyms column may contain what used to be in "name".: name is different in the new CSV (wikibase dump), which is name == preflabel and not blank as stated in the issue.
  • The node_types stats are correct.
MATCH (n)
RETURN DISTINCT labels(n),
count(*) AS NumberOfEntities, reduce(keys = [], keys_n in collect(keys(n)) | keys + filter(k in keys_n WHERE NOT k IN keys)) as EntityAttributes
ORDER BY NumberOfEntities DESC

To change in nodes file @stuppie

  • nothing :)

Comparing edges file

  • The are blank missing attributes instead of "NA".
  • If there are two lines for the same edge in the input file. One has no ref, one does. So in wikidata, they become one. On output, we end up with one line instead of two. This isn't an issue as we aren't actually missing anything. But we are missing that line when comparing the roundtrip.: This is not an issue in fact it is an improvement because it is edge redundancy without any provenance added which makes the graph traversal in neo4j unnecessarily more expensive. For those same edges that have different references i checked reference_uri and are concatenated correctly using "|".
  • reference_date are all blank because Greg's script ignored that info in the input file. I am gonna also ignore this column to upload it into neo4j, updating my lib.
  • exactMatch properties have the URI in the input file (and no wikidata URI), which is ok.
  • There are some missing edge_types:
    • None as expected since we agreed on ignore these edges
    • colocalizes_with and contributes_to (NuriaQueralt/ngly1-graph#3) are not named, but use the URI: issue with colocalizes with and contributes to and colocalizes_with and contributes_to because some have underscore instead of space and are not converted into the correct edge_type. My library is gonna fix that.
MATCH ()-[r]-()
RETURN DISTINCT type(r),
count(*) AS NumberOfRelationships, reduce(keys = [], keys_r in collect(keys(r)) | keys + filter(k in keys_r WHERE NOT k IN keys)) as EntityAttributes
ORDER BY NumberOfRelationships DESC
  • 4240 lines missed: according @stuppie analysis are 1) items with too many statements (>1000), which is fine ignored them by now. 2) rdf:type but it seems that are there to me counting type(edge) in neo4j... 3) colocalizes and contributes that it is already discussed, and 4) edges collapsed due to provenance redundancy, which is fine.

To change in edges file @stuppie

  • Nothing :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants