Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pointcloud): Add HDBSCAN to point cloud #57

Merged
merged 35 commits into from
Nov 28, 2022
Merged

Conversation

fjcasti1
Copy link
Contributor

@fjcasti1 fjcasti1 commented Nov 22, 2022

resolves #46

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@fjcasti1 fjcasti1 marked this pull request as ready for review November 22, 2022 22:42
src/phoenix/__init__.py Show resolved Hide resolved
src/phoenix/pcloud/pcloud.py Outdated Show resolved Hide resolved
src/phoenix/pcloud/pcloud.py Outdated Show resolved Hide resolved
float(point.y),
float(point.z),
]
"clusterId": point.cluster_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to propagate point metadata to the UI by populating a metaData field that is a dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean each point will have a metadata field the same as position?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on this with #63


@dataclass(frozen=True)
class UMAPProjector:
hyperparameters: Dict[str, Union[int, float, str]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make private - otherwise the programmer may think you can modify directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on providing a reasonable default for the constructor?

Copy link
Contributor Author

@fjcasti1 fjcasti1 Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, if we want to pass defaults and/or run proper validation, we should create a separate class UMAPHyperparameter, and put defaults and validations in it. It would be much more transparent and easier. The downside is then we would have to implement all the options of the UMAP class, which I think is not ideal.

The reason I went with this approach is that we don't need default values. When we call UMAP we use
UMAP(**hyperparameters). That command expands the dictionary setting the values specified and leaving the rest as default. So, our defaults are UMAP's defaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About making it private -- It is already done, by using dataclasses with frozen=true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About making it private -- It is already done, by using dataclasses with frozen=true
Nice

As for the type - the problem from a DX perspective is that there now is no type hints for these values - making this projector harder to use as I don't have documentation on the parameters. pyright actually catches the boundary issues - not 100% sure how. Thoughts on just making a strict type for the parameters we think need tuning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to create another class for the hyperparameters and have a private method that turns the class into a dictionary. To UMAP eyes it will be the same. To our users, it will be mostly the same and we can address new hyperparameters as they are requested. Or directly setting them all up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up in #64

src/phoenix/pcloud/pcloud.py Outdated Show resolved Hide resolved
Comment on lines 25 to 30
self.prediction_label = prediction_label
# self.prediction_score = prediction_score
self.actual_label = actual_label
# self.actual_score = actual_score
self.raw_text_data = raw_text_data
# self.link_to_data = link_to_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than having flat attributes, it might be good to structure these constructs a bit in a class or dict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

particularly since feature data will need to be serialized - Unlike with the arize platform, all data needs to be serialized in bulk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think it deserves a little conversation though. Don't know what direction to go particularly. I am tempted to call it metadata class since it will go in the metadata part of the json. But not too in love with that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inference or Event attributes seems appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up in #65

src/phoenix/pcloud/pcloud.py Outdated Show resolved Hide resolved
src/phoenix/pcloud/pcloud.py Outdated Show resolved Hide resolved
src/phoenix/pcloud/projectors.py Outdated Show resolved Hide resolved
Francisco Castillo and others added 11 commits November 22, 2022 15:19
Co-authored-by: Mikyo King <mikyo@arize.com>
Co-authored-by: Mikyo King <mikyo@arize.com>
…eat/add-hdbscan

* 'feat/add-hdbscan' of github.com:Arize-ai/phoenix:
  Update src/phoenix/pcloud/projectors.py
  Update src/phoenix/pcloud/pcloud.py
Co-authored-by: Mikyo King <mikyo@arize.com>
Co-authored-by: Mikyo King <mikyo@arize.com>
mikeldking and others added 10 commits November 22, 2022 21:27
…eat/add-hdbscan

* 'feat/add-hdbscan' of github.com:Arize-ai/phoenix:
  black
  more fixes
  fix types:
  Update src/phoenix/pcloud/projectors.py
  Update src/phoenix/pcloud/pcloud.py
  Delete _fit_transform method
  Fix json int bug
  Fix style
  Add HDBSCAN clustering
  Restore umap_demo deleted
  fix: python types (#55)
@fjcasti1 fjcasti1 changed the title feat: Add HDBSCAN to point cloud Add HDBSCAN to point cloud Nov 25, 2022
@fjcasti1 fjcasti1 changed the title Add HDBSCAN to point cloud feat(pointcloud): Add HDBSCAN to point cloud Nov 25, 2022
y: float

def get_coordinates(self):
return [float(self.x), float(self.y)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit confused on the casting to float here - is it for type safety?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is because the default coordinates spit from UMAP and HDBSCAN have been float32 and, if left like that, you get the error that is not JSON serializable. I am open to other solutions.

Co-authored-by: Mikyo King <mikyo@arize.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add HDBSCAN clustering to point cloud computation
2 participants