Skip to content

Add ability to load, partition, and separate node labels with features#276

Merged
mkolodner-sc merged 31 commits intomainfrom
mkolodner-sc/update_tfrecordloader_with_labels
Aug 19, 2025
Merged

Add ability to load, partition, and separate node labels with features#276
mkolodner-sc merged 31 commits intomainfrom
mkolodner-sc/update_tfrecordloader_with_labels

Conversation

@mkolodner-sc
Copy link
Copy Markdown
Collaborator

@mkolodner-sc mkolodner-sc commented Aug 15, 2025

Scope of work done

  • Adds ability for TFRecordDataloader to load labels as part of the features and corresponding tests
  • Adds function for separating labels from features and corresponding tests
  • Updates dataset factory to call this function to separate node labels if they are present

Explanation of Node Classification Labels

  1. As input to our node classification pipeline, we should have prepared bigquery node feature tables which have columns corresponding to the node classification label.

Taken from external-snap-ci-github-gigl.public_gigl.cora_homogeneous_supervised_node_classification_edge_features_paper_nodes_2024-07-15--21-30-07-UTC
Screenshot 2025-08-18 at 11 59 26 AM

  1. In our Data Preprocessor config, labels are specified [1] in the get_nodes_preprocessing_spec function.
  2. After running data preprocessor, we have generated TFRecords with the node label as a feature. We can see the preprocessed metadata output at gs://public-gigl/mocked_assets/2024-07-15--21-30-07-UTC/cora_homogeneous_supervised_node_classification/data_preprocess/preprocessed_metadata.yaml:
...
    - f1430
    - f1431
    - f1432
    labelKeys:
    - node_label
    nodeIdKey: node_id
    schemaUri: gs://public-gigl/mocked_assets/2024-07-15--21-30-07-UTC/cora_homogeneous_supervised_node_classification/data_preprocess/node_features_dir/paper/schema.pbtxt
    tfrecordUriPrefix: gs://public-gigl/mocked_assets/2024-07-15--21-30-07-UTC/cora_homogeneous_supervised_node_classification/data_preprocess/node_features_dir/paper/features/
  1. We load the preprocessed metadata into a PreprocessedMetadataPbWrapper. This creates and isolates fields we'll need for loading the TFRecords like the FeatureSchemaDict, FeatureDim, LabelKeys, etc. As of this PR, the FeatureSchemaDict includes the feature schemas for both the feature keys and label keys.
  2. We pass this information to the SerializedTFRecordInfo.
  3. We load the node features and labels, with the labels being appended as an additional feature to the normal feature tensor.
  4. We call our partitioner and partition the "node features" across all machines. The benefit of having labels as part of this process is that 1) there are less distributed calls and 2) there are guarantees that the mapping of global_id to local_id (id2idx) is correct for both node features and node labels.
  5. We extract the labels from the partitioned node features with _get_labels_from_features.
  6. We register those labels to the dataset in dataset.build() (TODO in the next PR)

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

@mkolodner-sc mkolodner-sc changed the title Update TFRecordLoader to Load Labels Add ability to load, partition, and separate labels with features Aug 15, 2025
@mkolodner-sc mkolodner-sc changed the title Add ability to load, partition, and separate labels with features Add ability to load, partition, and separate node labels with features Aug 15, 2025
Comment thread python/gigl/types/graph.py
Comment thread python/gigl/distributed/dataset_factory.py
Comment thread python/gigl/common/data/dataloaders.py Outdated
Comment thread python/gigl/common/data/dataloaders.py Outdated
Comment thread python/gigl/common/data/dataloaders.py Outdated
Comment thread python/gigl/distributed/dataset_factory.py
Comment thread python/gigl/distributed/dataset_factory.py
Comment thread python/gigl/utils/node_labels.py Outdated
Comment thread python/tests/unit/common/data/dataloaders_test.py
Comment thread python/tests/unit/utils/node_labels_test.py Outdated
Comment thread python/gigl/utils/node_labels.py Outdated
Comment thread python/gigl/distributed/dataset_factory.py
Comment thread python/gigl/distributed/dataset_factory.py Outdated
Comment thread python/gigl/types/graph.py
Comment thread python/tests/unit/common/data/dataloaders_test.py
Comment thread python/gigl/distributed/dataset_factory.py Outdated
Comment thread CHANGELOG.md
Comment thread python/gigl/common/data/dataloaders.py Outdated
Comment thread python/gigl/common/data/dataloaders.py
Copy link
Copy Markdown
Collaborator

@yliu2-sc yliu2-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this implementation. Only nit is that if we can store labels separately from features.

Comment thread python/gigl/common/data/dataloaders.py
Comment thread python/gigl/distributed/dataset_factory.py
Copy link
Copy Markdown
Collaborator

@svij-sc svij-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay to me.

Comment thread python/gigl/distributed/dataset_factory.py
Comment thread python/gigl/distributed/dataset_factory.py
Comment thread python/gigl/distributed/dataset_factory.py
@mkolodner-sc
Copy link
Copy Markdown
Collaborator Author

/unit_test

@mkolodner-sc
Copy link
Copy Markdown
Collaborator Author

/e2e_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Aug 19, 2025

GiGL Automation

@ 06:44:04UTC : 🔄 Unit Test started.

@ 07:20:10UTC : ❌ Workflow failed.
Please check the logs for more details.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Aug 19, 2025

GiGL Automation

@ 06:44:05UTC : 🔄 E2E Test started.

@ 08:08:05UTC : ✅ Workflow completed successfully.

@mkolodner-sc
Copy link
Copy Markdown
Collaborator Author

/unit_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Aug 19, 2025

GiGL Automation

@ 08:16:53UTC : 🔄 Unit Test started.

@ 08:53:30UTC : ✅ Workflow completed successfully.

@mkolodner-sc mkolodner-sc marked this pull request as ready for review August 19, 2025 16:38
@mkolodner-sc mkolodner-sc requested a review from nshah-sc as a code owner August 19, 2025 16:38
@mkolodner-sc mkolodner-sc added this pull request to the merge queue Aug 19, 2025
Comment thread python/gigl/common/data/dataloaders.py
Comment thread python/gigl/distributed/dataset_factory.py
Comment thread python/gigl/src/common/types/pb_wrappers/preprocessed_metadata.py
@mkolodner-sc mkolodner-sc removed this pull request from the merge queue due to a manual request Aug 19, 2025
Comment thread python/gigl/common/data/dataloaders.py Outdated
Comment thread python/gigl/distributed/dataset_factory.py Outdated
Comment thread python/gigl/src/common/types/pb_wrappers/preprocessed_metadata.py
@mkolodner-sc mkolodner-sc added this pull request to the merge queue Aug 19, 2025
Merged via the queue into main with commit 04b3b7a Aug 19, 2025
5 checks passed
@mkolodner-sc mkolodner-sc deleted the mkolodner-sc/update_tfrecordloader_with_labels branch August 19, 2025 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants