Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetworkX shape passed value error and how can I help? #201

Closed
fils opened this issue Oct 8, 2021 · 11 comments
Closed

NetworkX shape passed value error and how can I help? #201

fils opened this issue Oct 8, 2021 · 11 comments

Comments

@fils
Copy link
Contributor

fils commented Oct 8, 2021

@ceteri
Paco,
So I have some time to spend working with some schema.org based data from Hydroshare and exploring using kglabs to explore it. I'm having issues applying it and I hope in resolving them I might be able to help somehow in the docs and such.

Hopefully this isn't just me being stupid in graph space, but is of some help back to the project. Happy to share.

Working with the same data from Issue 24 and now trying the NetworkX area I got a specific error.

So this code:

import networkx as nx

sparql3 = """
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?subject ?object
WHERE { 
  ?subject a <https://schema.org/Dataset> .
  ?subject <https://schema.org/creator> ?creator .
  ?creator rdf:first ?o .
  ?o <https://schema.org/name> ?object
}
  """

subgraph = kglab.SubgraphMatrix(kg, sparql3)
nx_graph = subgraph.build_nx_graph(nx.DiGraph(), bipartite=True)

results in this error

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_3293779/2456468873.py in <module>
     13 
     14 subgraph = kglab.SubgraphMatrix(kg, sparql3)
---> 15 nx_graph = subgraph.build_nx_graph(nx.DiGraph(), bipartite=True)

~/.conda/envs/kglab/lib/python3.8/site-packages/kglab/subg.py in build_nx_graph(self, nx_graph, bipartite)
    250         """
    251         if self.kg.use_gpus:
--> 252             df = self.build_df()
    253             nx_graph.from_cudf_edgelist(df, source="src", destination="dst")
    254         else:

~/.conda/envs/kglab/lib/python3.8/site-packages/kglab/subg.py in build_df(self, show_symbols)
    223 
    224         if self.kg.use_gpus:
--> 225             df = cudf.DataFrame(rows_list, columns=col_names)
    226         else:
    227             df = pd.DataFrame(rows_list, columns=col_names)

~/.conda/envs/kglab/lib/python3.8/contextlib.py in inner(*args, **kwds)
     73         def inner(*args, **kwds):
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner
     77 

~/.conda/envs/kglab/lib/python3.8/site-packages/cudf/core/dataframe.py in __init__(self, data, index, columns, dtype)
    257                     )
    258                 else:
--> 259                     self._init_from_list_like(
    260                         data, index=index, columns=columns
    261                     )

~/.conda/envs/kglab/lib/python3.8/site-packages/cudf/core/dataframe.py in _init_from_list_like(self, data, index, columns)
    397         if columns is not None:
    398             if len(columns) != len(data):
--> 399                 raise ValueError(
    400                     f"Shape of passed values is ({len(index)}, {len(data)}), "
    401                     f"indices imply ({len(index)}, {len(columns)})."

ValueError: Shape of passed values is (5293, 5293), indices imply (5293, 2).

The results of that SPARQL on the graph should be like:

subject,object
https://www.hydroshare.org/resource/aefabd0a6d7d47ebaa32e2fb293c9f8a#schemaorg,Courtney G Flint
https://www.hydroshare.org/resource/f94ac7f8d8a048cdbd2610dfa7cd315b#schemaorg,Zhiyu (Drew) Li
https://www.hydroshare.org/resource/f9a75c0b289649aa844e84c24f9f5780#schemaorg,Young-Don Choi
https://www.hydroshare.org/resource/173875a936f14c22a5ba19c721adfb86#schemaorg,Remi Dupas
https://www.hydroshare.org/resource/f1116211202a4c069919797272023e62#schemaorg,Nathan Swain
https://www.hydroshare.org/resource/6d80e4bd00244b5dabaff34074cd3102#schemaorg,Garrick Stephenson
https://www.hydroshare.org/resource/25133b13a1fc4fca9187c2d4e272d4e8#schemaorg,Jessie Myers
https://www.hydroshare.org/resource/ca0f2f0f28ba40018ae64b973e2bb35a#schemaorg,Ruth B. MacNeille
https://doi.org/10.4211/hs.88454dae8c604009b684bfa136e5f7f4#schemaorg,Celray James CHAWANDA
https://doi.org/10.4211/hs.1c6034be6886412ba59970ab1157fa7e#schemaorg,Bethany Neilson
for 5293 lines
@Mec-iS
Copy link
Contributor

Mec-iS commented Nov 29, 2021

From the error message I can see you are running on GPUs, you got the same error running on plain CPUs?
It looks like is complaining for the shape of the dataframe, I will look into it.

@Mec-iS
Copy link
Contributor

Mec-iS commented Nov 29, 2021

I have collected all your code from issue #200 and this one into this Colab notebook or in this Github gist

If you have a Google account you can open and run it remotely (on a free tier). The code works there and you can check if the results returned are the ones expected by copying and modifying the notebook, let me know how it goes!

Please provide your local system's specs (OS, you are using a virtual environment, how did you install your packages) so I can understand the problem you have in running on your machine.
The suggested way to have control on the packages installed is to use a Python Virtual Environment. Also I find really useful to use Jupyter Lab desktop application to manage package installation.

@fils
Copy link
Contributor Author

fils commented Dec 7, 2021

@Mec-iS Thanks for your help on these two items.

The SPARQL does work in Colab so I suspect this is an issue with my GPU usage as you point out.

Is there a simple way to turn off GPU leveraging in a notebook? I honestly don't know how I would disable that for a specific case.

@Mec-iS
Copy link
Contributor

Mec-iS commented Dec 7, 2021

sure. no problem.

The instantiation of the graph has an option to disable GPUs:

kg = kglab.KnowledgeGraph(
    name = "hydroshare",
    namespaces = ns,
    use_gpus=False
    )

@fils
Copy link
Contributor Author

fils commented Dec 7, 2021

@Mec-iS Thanks!

Turning off the GPU removed the error. Thanks much (now I need to resolve my GPU issue, but that can wait for another time).

I'm not well disciplined in my python usage, I am using conda to set up my environment and you can find it here if you are interested further in this issues (https://github.com/gleanerio/notebooks/blob/master/environment.yml) but it's a bit large (see reference about lack of discipline )

You've resolved this problem though so feel free to close.
I'll post new problems in new issues.. ;)

@ceteri
Copy link
Collaborator

ceteri commented Dec 9, 2021

Hi @fils ,

Which is the dataset that you're using?
It's not on #24 -- but another?

I can try to recreate the issue on my Linux laptop which has an NVIDIA GPU.

It may be that some underlying dependencies for RAPIDS have changed. They have a somewhat non-standard "release selector" which we haven't updated in several months https://rapids.ai/start.html

@Mec-iS
Copy link
Contributor

Mec-iS commented Dec 9, 2021

the dataset is the one in #200, can be downloaded from s3.

@fils
Copy link
Contributor Author

fils commented Dec 9, 2021

@ceteri @Mec-iS

I pushed up some of what I am working on (including the graphs) to https://github.com/gleanerio/notebooks/tree/master/Hydroshare

As noted, this should be the same graph as at the S3 (updated with prefix for schema.org).

I think the issue just may be the graph and the way I am approaching it not being the best. So I think things are working fine (sans the GPU issue I have.. which could be my install.. driver 495.29.05 by the way on a GTX 1050 Ti, nothing too special).

I'm trying to work up some ways people can inspect their schema.org based graphs around their datasets coming them implementing https://github.com/ESIPFed/science-on-schema.org/ guidance. So any course corrections or guidance would be more than welcome!

Thanks for your engagement with this..

@charlesvardeman
Copy link

So @ceteri, I think that you are correct on the RAPIDS release selector. We have RAPIDS installed on a development node of our gpu cluster using the following selector
conda create -n rapids-21.12 -c rapidsai -c nvidia -c conda-forge
cudf=21.12 cuml=21.12 cugraph=21.12 python=3.8 cudatoolkit=11.2

Running the example from the tutorial:

import kglab

namespaces = {
    "wtm":  "http://purl.org/heals/food/",
    "ind":  "http://purl.org/heals/ingredient/",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    }

kg = kglab.KnowledgeGraph(
    name = "A recipe KG example based on Food.com",
    base_uri = "https://www.food.com/recipe/",
    namespaces = namespaces,
    )

produces a similar error message to what @fils was seeing.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_2396895/1517367763.py in <module>
----> 1 kg.describe_ns()

/opt/anaconda3/envs/rapids-21.12/lib/python3.8/site-packages/kglab/kglab.py in describe_ns(self)
    254 
    255         if self.use_gpus:
--> 256             df = cudf.DataFrame(rows_list, columns=col_names)
    257         else:
    258             df = pd.DataFrame(rows_list, columns=col_names)

/opt/anaconda3/envs/rapids-21.12/lib/python3.8/contextlib.py in inner(*args, **kwds)
     73         def inner(*args, **kwds):
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner
     77 

/opt/anaconda3/envs/rapids-21.12/lib/python3.8/site-packages/cudf/core/dataframe.py in __init__(self, data, index, columns, dtype)
    610                     )
    611                 else:
--> 612                     self._init_from_list_like(
    613                         data, index=index, columns=columns
    614                     )

/opt/anaconda3/envs/rapids-21.12/lib/python3.8/site-packages/cudf/core/dataframe.py in _init_from_list_like(self, data, index, columns)
    750         if columns is not None:
    751             if len(columns) != len(data):
--> 752                 raise ValueError(
    753                     f"Shape of passed values is ({len(index)}, {len(data)}), "
    754                     f"indices imply ({len(index)}, {len(columns)})."

ValueError: Shape of passed values is (31, 31), indices imply (31, 2).

The machine details are:

Wed Feb 16 14:53:35 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.94       Driver Version: 470.94       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 6000     Off  | 00000000:00:09.0 Off |                    0 |
| N/A   18C    P8    13W / 250W |      3MiB / 22698MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 6000     Off  | 00000000:00:0A.0 Off |                    0 |
| N/A   21C    P8    13W / 250W |      3MiB / 22698MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 6000     Off  | 00000000:00:0B.0 Off |                    0 |
| N/A   21C    P8    13W / 250W |      3MiB / 22698MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Quadro RTX 6000     Off  | 00000000:00:0C.0 Off |                    0 |
| N/A   20C    P8    12W / 250W |      3MiB / 22698MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The node is running Red Hat Enterprise Linux release 8.5 (Ootpa), Python 3.8.12 (default, Oct 12 2021, 13:49:34)
[GCC 7.5.0]

@Mec-iS
Copy link
Contributor

Mec-iS commented Mar 8, 2022

@charlesvardeman please move your comment to the RAPIDS related discussion or open a new issue.

I close this as been resolved.

@Mec-iS Mec-iS closed this as completed Mar 8, 2022
@ceteri
Copy link
Collaborator

ceteri commented Mar 8, 2022

@charlesvardeman @Mec-iS @fils:

I've opened another issue #229 specifically to track the updates we need to do for supporting RAPIDS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants