Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow embeddings reads from csv file format #71

Merged
merged 22 commits into from
Nov 30, 2022

Conversation

fjcasti1
Copy link
Contributor

@fjcasti1 fjcasti1 commented Nov 29, 2022

Closes #22, #61

  • Remove dataclass decorator from Dataset class
  • Add validation module
  • Add dataset validation on creation
  • Make prediction_id optional
  • Enable embedding reads from csv
  • Update example

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@fjcasti1 fjcasti1 marked this pull request as ready for review November 29, 2022 08:57
@fjcasti1 fjcasti1 self-assigned this Nov 29, 2022
@fjcasti1 fjcasti1 linked an issue Nov 29, 2022 that may be closed by this pull request
3 tasks
src/phoenix/datasets/dataset.py Show resolved Hide resolved
src/phoenix/datasets/dataset.py Outdated Show resolved Hide resolved
src/phoenix/datasets/dataset.py Outdated Show resolved Hide resolved
src/phoenix/datasets/dataset.py Outdated Show resolved Hide resolved
src/phoenix/datasets/dataset.py Outdated Show resolved Hide resolved
src/phoenix/datasets/dataset.py Show resolved Hide resolved
src/phoenix/datasets/types.py Show resolved Hide resolved
src/phoenix/validation/dataset_validation.py Outdated Show resolved Hide resolved
src/phoenix/validation/errors.py Outdated Show resolved Hide resolved
Comment on lines 22 to 30
class MissingVectorColumn(ValidationError):
def __init__(self, col: str) -> None:
self.missing_col = col

def error_message(self) -> str:
return (
f"The embedding vector column {self.missing_col} is declared in the schema "
"but is not found in the dataframe."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure sub-classing for every case makes sense here - this will resort in too many error types - can you just use the DatasetError below?

Also you are leaking internals by using dataframe in the message - it's not a huge deal but the caller doesn't need to know that we are using dataframes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure sub-classing for every case makes sense here - this will resort in too many error types - can you just use the DatasetError below?

See if you like it better now. There are specific errors for each situation. What I like about this approach is that the message stays consistent, instead of a developer changing the message 2 months from now to say the same thing.

Also you are leaking internals by using dataframe in the message - it's not a huge deal but the caller doesn't need to know that we are using dataframes.

Would you like it better if it said not found in data? Besides from this, most of our code has column_name(s) at the end. It gives a pretty good indication we are using tables. Just raising in case you would like me to change that in a follow up PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are 3 example errors:

Screen Shot 2022-11-29 at 4 35 54 PM
Screen Shot 2022-11-29 at 4 35 59 PM
Screen Shot 2022-11-29 at 4 36 05 PM

Co-authored-by: Mikyo King <mikyo@arize.com>
Copy link
Contributor

@mikeldking mikeldking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Much cleaner.

src/phoenix/datasets/errors.py Outdated Show resolved Hide resolved
src/phoenix/datasets/errors.py Outdated Show resolved Hide resolved
src/phoenix/datasets/errors.py Outdated Show resolved Hide resolved
src/phoenix/datasets/errors.py Outdated Show resolved Hide resolved
src/phoenix/datasets/errors.py Outdated Show resolved Hide resolved
src/phoenix/datasets/errors.py Outdated Show resolved Hide resolved
src/phoenix/datasets/errors.py Outdated Show resolved Hide resolved

@classmethod
def from_dataframe(cls, dataframe: DataFrame, schema: Schema):
return cls(dataframe, schema)

@classmethod
def from_csv(cls, filepath: str, schema: Schema):
return cls(read_csv(filepath), schema)
dataframe: DataFrame = read_csv(filepath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind adding a small unit test here? - might be a good excuse to split out the parsing from the file system read.

src/phoenix/datasets/validation.py Outdated Show resolved Hide resolved
from .types import Schema


def validate_dataset_inputs(dataframe: DataFrame, schema: Schema) -> List[err.ValidationError]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good opportunity for unit tests!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up in #73

Francisco Castillo and others added 6 commits November 29, 2022 16:37
Co-authored-by: Mikyo King <mikyo@arize.com>
Co-authored-by: Mikyo King <mikyo@arize.com>
Co-authored-by: Mikyo King <mikyo@arize.com>
Co-authored-by: Mikyo King <mikyo@arize.com>
Co-authored-by: Mikyo King <mikyo@arize.com>
Francisco Castillo and others added 4 commits November 29, 2022 16:40
Co-authored-by: Mikyo King <mikyo@arize.com>
Co-authored-by: Mikyo King <mikyo@arize.com>
Co-authored-by: Mikyo King <mikyo@arize.com>
@fjcasti1 fjcasti1 merged commit 183c63a into main Nov 30, 2022
@fjcasti1 fjcasti1 deleted the read-embeddings-from-csv branch November 30, 2022 01:20
fjcasti1 pushed a commit that referenced this pull request Nov 30, 2022
* main:
  feat: Allow embeddings reads from csv file format (#71)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[metrics] CSV parsing for embeddings Embeddings are read as full strings from csv files
2 participants