feat: Allow embeddings reads from csv file format #71

fjcasti1 · 2022-11-29T08:49:36Z

Closes #22, #61

Remove dataclass decorator from Dataset class
Add validation module
Add dataset validation on creation
Make prediction_id optional
Enable embedding reads from csv
Update example

review-notebook-app · 2022-11-29T08:49:40Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

src/phoenix/datasets/dataset.py

src/phoenix/datasets/types.py

src/phoenix/datasets/dataset.py

src/phoenix/datasets/types.py

src/phoenix/validation/dataset_validation.py

src/phoenix/validation/errors.py

mikeldking · 2022-11-29T18:34:49Z

src/phoenix/validation/errors.py

+class MissingVectorColumn(ValidationError):
+    def __init__(self, col: str) -> None:
+        self.missing_col = col
+
+    def error_message(self) -> str:
+        return (
+            f"The embedding vector column {self.missing_col} is declared in the schema "
+            "but is not found in the dataframe."
+        )


I'm not sure sub-classing for every case makes sense here - this will resort in too many error types - can you just use the DatasetError below?

Also you are leaking internals by using dataframe in the message - it's not a huge deal but the caller doesn't need to know that we are using dataframes.

I'm not sure sub-classing for every case makes sense here - this will resort in too many error types - can you just use the DatasetError below?

See if you like it better now. There are specific errors for each situation. What I like about this approach is that the message stays consistent, instead of a developer changing the message 2 months from now to say the same thing.

Also you are leaking internals by using dataframe in the message - it's not a huge deal but the caller doesn't need to know that we are using dataframes.

Would you like it better if it said not found in data? Besides from this, most of our code has column_name(s) at the end. It gives a pretty good indication we are using tables. Just raising in case you would like me to change that in a follow up PR

Here are 3 example errors:

src/phoenix/datasets/dataset.py

Co-authored-by: Mikyo King <mikyo@arize.com>

mikeldking

Thanks! Much cleaner.

src/phoenix/datasets/errors.py

mikeldking · 2022-11-30T00:26:22Z

src/phoenix/datasets/dataset.py


    @classmethod
    def from_dataframe(cls, dataframe: DataFrame, schema: Schema):
        return cls(dataframe, schema)

    @classmethod
    def from_csv(cls, filepath: str, schema: Schema):
-        return cls(read_csv(filepath), schema)
+        dataframe: DataFrame = read_csv(filepath)


mind adding a small unit test here? - might be a good excuse to split out the parsing from the file system read.

src/phoenix/datasets/validation.py

mikeldking · 2022-11-30T00:28:03Z

src/phoenix/datasets/validation.py

+from .types import Schema
+
+
+def validate_dataset_inputs(dataframe: DataFrame, schema: Schema) -> List[err.ValidationError]:


Good opportunity for unit tests!

Following up in #73

Co-authored-by: Mikyo King <mikyo@arize.com>

* main: feat: Allow embeddings reads from csv file format (#71)

Francisco Castillo added 6 commits November 28, 2022 22:31

Remove dataclass decorator from Dataset class

08a3c60

Add validation module

7a4d55c

Add dataset validation on creation

1e8f4ef

Make prediction_id optional

babb4d9

Enable embedding reads from csv

c1e2c31

Update example

20010d8

fjcasti1 commented Nov 29, 2022

View reviewed changes

src/phoenix/datasets/dataset.py Show resolved Hide resolved

fjcasti1 commented Nov 29, 2022

View reviewed changes

src/phoenix/datasets/dataset.py Show resolved Hide resolved

fjcasti1 commented Nov 29, 2022

View reviewed changes

src/phoenix/datasets/types.py Show resolved Hide resolved

Fix type check

d6c0829

fjcasti1 marked this pull request as ready for review November 29, 2022 08:57

fjcasti1 self-assigned this Nov 29, 2022

fjcasti1 linked an issue Nov 29, 2022 that may be closed by this pull request

[metrics] CSV parsing for embeddings #61

Closed

3 tasks

fjcasti1 requested review from mikeldking and davidgmonical November 29, 2022 08:59

mikeldking reviewed Nov 29, 2022

View reviewed changes

Francisco Castillo added 4 commits November 29, 2022 14:58

Make dataset validation self-contained

ed61140

Remove DatasetValidator class

93428b0

Restructure errors

8f7c1c1

Add docstrings to errors

15f879e

mikeldking reviewed Nov 30, 2022

View reviewed changes

src/phoenix/datasets/dataset.py Outdated Show resolved Hide resolved

Update src/phoenix/datasets/dataset.py

bf3ab99

Co-authored-by: Mikyo King <mikyo@arize.com>

mikeldking approved these changes Nov 30, 2022

View reviewed changes

Francisco Castillo and others added 6 commits November 29, 2022 16:37

wip

2f6479f

Update src/phoenix/datasets/errors.py

f15ed6f

Co-authored-by: Mikyo King <mikyo@arize.com>

Update src/phoenix/datasets/errors.py

78e4875

Co-authored-by: Mikyo King <mikyo@arize.com>

Update src/phoenix/datasets/errors.py

9ace8cb

Co-authored-by: Mikyo King <mikyo@arize.com>

Update src/phoenix/datasets/errors.py

e554884

Co-authored-by: Mikyo King <mikyo@arize.com>

Update src/phoenix/datasets/errors.py

574d33e

Co-authored-by: Mikyo King <mikyo@arize.com>

Francisco Castillo and others added 4 commits November 29, 2022 16:40

Update src/phoenix/datasets/errors.py

6df2e93

Co-authored-by: Mikyo King <mikyo@arize.com>

Update src/phoenix/datasets/errors.py

173bef8

Co-authored-by: Mikyo King <mikyo@arize.com>

Update src/phoenix/datasets/errors.py

4591573

Co-authored-by: Mikyo King <mikyo@arize.com>

Remove unnecessary underscore

03d7418

fjcasti1 mentioned this pull request Nov 30, 2022

Add unit tests for validate_dataset_inputs #73

Closed

fjcasti1 merged commit 183c63a into main Nov 30, 2022

fjcasti1 deleted the read-embeddings-from-csv branch November 30, 2022 01:20

fjcasti1 pushed a commit that referenced this pull request Nov 30, 2022

Merge branch 'main' into add-inference-attributes

f76f20e

* main: feat: Allow embeddings reads from csv file format (#71)

fjcasti1 mentioned this pull request Nov 30, 2022

Decouple read_csv from (embedding) file sanitation. #74

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Allow embeddings reads from csv file format #71

feat: Allow embeddings reads from csv file format #71

fjcasti1 commented Nov 29, 2022 •

edited

Loading

review-notebook-app bot commented Nov 29, 2022

mikeldking Nov 29, 2022

fjcasti1 Nov 30, 2022

fjcasti1 Nov 30, 2022

mikeldking left a comment

mikeldking Nov 30, 2022

mikeldking Nov 30, 2022

fjcasti1 Nov 30, 2022

		from .types import Schema


		def validate_dataset_inputs(dataframe: DataFrame, schema: Schema) -> List[err.ValidationError]:

feat: Allow embeddings reads from csv file format #71

feat: Allow embeddings reads from csv file format #71

Conversation

fjcasti1 commented Nov 29, 2022 • edited Loading

review-notebook-app bot commented Nov 29, 2022

mikeldking Nov 29, 2022

Choose a reason for hiding this comment

fjcasti1 Nov 30, 2022

Choose a reason for hiding this comment

fjcasti1 Nov 30, 2022

Choose a reason for hiding this comment

mikeldking left a comment

Choose a reason for hiding this comment

mikeldking Nov 30, 2022

Choose a reason for hiding this comment

mikeldking Nov 30, 2022

Choose a reason for hiding this comment

fjcasti1 Nov 30, 2022

Choose a reason for hiding this comment

fjcasti1 commented Nov 29, 2022 •

edited

Loading